December 06, 2019

An introduction to Generative Adversarial Networks (in Swift for TensorFlow)

Generative adversarial networks, or GANS, are one of the most interesting ideas in deep learning. Using GANs computers get a sense of imagination, they can create their own “things”. But how do they do that? It’s easy to have a computer generate random data, but that’s of low value to us, humans. How does a computer generate something that looks like items in the dataset?

Those are the questions Ian Goodfellow et al asked while writing their paper “Generative Adversarial Networks” (arxiv). They came up with something extremely clever: they wondered if the computer would be able to classify something as belonging to the dataset or not. That something would sometimes be an example from the dataset, and sometimes something the computer generated. They did this using a deep neural network, called the discriminator.

But then how do you get a computer to generate something? You start with a tensor filled with random numbers–you wouldn’t want each generated image to look too similar. This tensor is then inputted to a neural network, the generator, which is able to generate images from it.

So how does the generator generate realistically looking images? Well, at first it does not. As we will see later on in this post. To understand how it acquires that capability, we have to look at the training loop of a GAN.

During one epoch, both the discriminator and generator will be trained. During the first epoch, both neural networks will perform bad. Discriminators don’t know what is a realistic image and generators generate random things. As the generator continues learning however, it starts to generate images that are so realistic, the discriminator will have a hard time discriminating. So we train the disciminator a bit more until it does mark those images as fake. Then it’s the generators tasks to improve and have the discriminator fail onc again. And so on and so on.

To recap:

  • A generator generates images from random data, using a neural network, that
  • the discriminator can’t distinguish from “real” images.

During training, both improve.

If you’ve never worked with Swift for TensorFlow before, I highly recommend you check out my previous post “Your first Swift for TensorFlow model”.

Deep Convolutional Generative Adversarial Networks

It’s not just a set of buzzwords.

DCGANs (arxiv) are a particular type of GAN that uses transposed convolutional layers, or sometimes called “deconvolutional layers”. The mathematics of these operations are beyond the scope of this post, but here’s a Wikipedia page if you’re interested to learn more. For now just remember that transposed convolutional layers do the opposite of convolutional layers: they create images from shapes instead of filtering out shapes in images.

With these layers you can do fun things. For example, you can generate handwritten digits from a random \(z \in \mathbb{R}^{100}\) vector.

Here’s the generator architecture the researchers used when developing DCGANs (source: arxiv)


Let’s build one

The best way to really understand something is often to build it yourself. So let’s see how to build a DCGAN to generate handwritten digits (MNIST).

We start by downloading and importing dependencies (learn more here):

%install-location $cwd/swift-install
%install '.package(url: "", .branch("tensorflow-0.6"))' Datasets
import TensorFlow
import Foundation
import Datasets
%include "EnableIPythonDisplay.swift""inline")
import Python
let plt = Python.import("matplotlib.pyplot")
let np = Python.import("numpy")

Now we can load MNIST:

let batchSize = 512
let mnist = MNIST(batchSize: batchSize, flattening: false, normalizing: true)

In Swift you can declare a transposed convolutional layer as follows. Note that the input and output dimensions are flipped around compared to a normal convolutional layer.

    filterShape: (5, 5, 128, 256),
    strides: (1, 1),
    padding: .same

Building a generator

Recall that the generator creates images from random data (\(z \in \mathbb{R}^{100}\)) using transposed convolutional layers. We start by specifying this dimension (it’s called \(z\) in the paper), which we will use when building the model:

let zDim = 100

Then we will build the model. This architecture is based off of the architecture used in the paper. The key difference between our version and the one proposed in the paper is that our output is \(\mathbb{R}^{28 \times 28}\) where the paper uses \(\mathbb{R}^{64 \times 64 \times 3}\) images. To achieve this result, the paper scales up the input by doubling the width and height of the tensor. Where the paper starts with \(4 \times 4\) and scales up to \(64 \times 64\), we start with \(7 \times 7\) and scale up to \(28 \times 28\). Upscaling is done, of course, using the transposed convolutional layers.

BatchNorm layers expect the data to be flatten, for now, so we flatten the data before passing it onto a BatchNorm layer and reshape it again afterwards. Because our model should work when we are testing it one image as well as on batches, we need to reshape the layer after we know the final size, which might vary. It’s hard to think about this forth dimension we are using here, so instead think about a stack of 3-dimensional tensors.

Note that we can reuse the flatten layer multiple times, both at inference and training time, because it has no parameters. I’ve added comments to clarify the feedforward step.

struct Generator: Layer {
    var flatten = Flatten<Float>()

    var dense1 = Dense<Float>(inputSize: zDim, outputSize: 7 * 7 * 256) 
    var batchNorm1 = BatchNorm<Float>(featureCount: 7 * 7 * 256)
    // leaky relu
    // reshape

    var transConv2D1 = TransposedConv2D<Float>(
        filterShape: (5, 5, 128, 256),
        strides: (1, 1),
        padding: .same
    // flatten
    var batchNorm2 = BatchNorm<Float>(featureCount: 7 * 7 * 128)
    // leaky relu

    var transConv2D2 = TransposedConv2D<Float>(
        filterShape: (5, 5, 64, 128),
        strides: (2, 2),
        padding: .same
    // flatten
    var batchNorm3 = BatchNorm<Float>(featureCount: 14 * 14 * 64)
    // leaky relu

    var transConv2D3 = TransposedConv2D<Float>(
        filterShape: (5, 5, 1, 64),
        strides: (2, 2),
        padding: .same
    // tanh

    public func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> { 
        let x1 = leakyRelu(input.sequenced(through: dense1, batchNorm1))
        let x1Reshape = x1.reshaped(to: TensorShape(x1.shape.contiguousSize / (7 * 7 * 256), 7, 7, 256))
        let x2 = leakyRelu(x1Reshape.sequenced(through: transConv2D1, flatten, batchNorm2))
        let x2Reshape = x2.reshaped(to: TensorShape(x2.shape.contiguousSize / (7 * 7 * 128), 7, 7, 128))
        let x3 = leakyRelu(x2Reshape.sequenced(through: transConv2D2, flatten, batchNorm3))
        let x3Reshape = x3.reshaped(to: TensorShape(x3.shape.contiguousSize / (14 * 14 * 64), 14, 14, 64))
        return tanh(transConv2D3(x3Reshape))

After building a network it is always a good idea to check and see if it actually works.

var generator = Generator()

Now create a noise vector, \(z\):

let noise = Tensor<Float>(randomNormal: TensorShape(1, zDim))
let generatedImage = generator(noise)

This should return something like this:

▿ [1, 28, 28, 1]
  ▿ dimensions : 4 elements
    - 0 : 1
    - 1 : 28
    - 2 : 28
    - 3 : 1

Great–that’s the shape real MNIST images are in!

Let’s have a look at this image:

plt.imshow(generatedImage.reshaped(to: TensorShape(28, 28)).makeNumpyArray())

output of GAN with no training

Uhh that doesn’t look like a digit? Let’s have the computer draw the same conclusion.

Your image might look slightly different since we are working with a random initialization.

Building a discriminator

Recall that the task of the discriminator is to discriminate images between “real” and “fake.” This is a type of binary classification, so the neural net will have one output only. And what digit a particular image is doesn’t matter. The output will be \(0\) when the image is fake and \(1\) when it’s real (hopefully).

This network should be fairly straightforward:

struct Discriminator: Layer {
      var conv2D1 = Conv2D<Float>(
          filterShape: (5, 5, 1, 64),
          strides: (2, 2),
          padding: .same
      // leaky relu
      var dropout = Dropout<Float>(probability: 0.3)

      var conv2D2 = Conv2D<Float>(
          filterShape: (5, 5, 64, 128),
          strides: (2, 2),
          padding: .same
      // leaky relu
      // dropout

      var flatten = Flatten<Float>()
      var dense = Dense<Float>(inputSize: 6272, outputSize: 1)

      public func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
          let x1 = dropout(leakyRelu(conv2D1(input)))
          let x2 = dropout(leakyRelu(conv2D2(x1)))
          return x2.sequenced(through: flatten, dense)

Let’s do a quick test once more. What would the discriminator think about our generated image?

var discriminator = Discriminator()

It should return something like this:


Again, this number might be different for you. As long as it is a scalar you’re good.


There’s one more thing before we can start building the training loop and that’s a loss function.

Generator loss

When we train the generator, we want it the output to be as realistic as possible. We determine how realistic the images are using the generator. This is the only way we can measure the generator’s performance. We peanalize the model for creating bad images (according to the discriminator).

In code it would something like.

func generatorLoss(fakeLabels: Tensor<Float>) -> Tensor<Float> {
    sigmoidCrossEntropy(logits: fakeLabels,
                        labels: Tensor(ones: fakeLabels.shape))

The fakeLabels will be generated by the discriminator later on in the training loop. So we use the discriminator when evaluating the generator!

Discriminator loss

We need the generator to determine the loss for the discriminator, too. Training a GAN really is a strongly coppled process.

The discriminator is peanlized for not recognizing the generators images as fake, and the training images as real–there are two summands in this equation.

func discriminatorLoss(realLabels: Tensor<Float>, fakeLabels: Tensor<Float>) -> Tensor<Float> {
    let realLoss = sigmoidCrossEntropy(logits: realLabels,
                                       labels: Tensor(ones: realLabels.shape)) // should say it's real, 1
    let fakeLoss = sigmoidCrossEntropy(logits: fakeLabels,
                                       labels: Tensor(zeros: fakeLabels.shape)) // should say it's fake, 0
    return realLoss + fakeLoss // accumaltive


Since modern optimizers use historical data when applying gradients to a model, we need two seperate optimizers to train a GAN.

let optG = Adam(for: generator, learningRate: 0.0001)
let optD = Adam(for: discriminator, learningRate: 0.0001)

The training loop

This is what our training loop will like like:

for epoch in 0...epochs {
    Context.local.learningPhase = .training
    for i  in 0..<(mnist.trainingSize / batchSize)+1  {
        let realImages = mnist.trainingImages.minibatch(at: i, batchSize: i * batchSize >= mnist.trainingSize ? (mnist.trainingSize - ((i - 1) * batchSize)) : batchSize)

        // todo: train generator

        // todo: train discriminator

    // todo: render samples
    Context.local.learningPhase = .inference

Let’s fill in the blanks.

We will train the discriminator and generator on random data, so it doesn’t overfit. But we will use our existing noise vector at inference time, so we can track the progress.

First we will train the generator with the .gradient api. Replace // todo: train generator with:

let noiseG = Tensor<Float>(randomNormal: TensorShape(batchSize, zDim))
let 𝛁generator = generator.gradient { generator -> Tensor<Float> in
    let fakeImages = generator(noiseG)
    let fakeLabels = discriminator(fakeImages)
    let loss = generatorLoss(fakeLabels: fakeLabels)
    return loss
optG.update(&generator, along: 𝛁generator)

Same thing for // todo: train discriminator:

let noiseD = Tensor<Float>(randomNormal: TensorShape(batchSize, zDim))
let fakeImages = generator(noiseD)

let 𝛁discriminator = discriminator.gradient { discriminator -> Tensor<Float> in
    let realLabels = discriminator(realImages)
    let fakeLabels = discriminator(fakeImages)
    let loss = discriminatorLoss(realLabels: realLabels, fakeLabels: fakeLabels)
    return loss
optD.update(&discriminator, along: 𝛁discriminator)

Now we can inspect how our model is doing.

Remember to first set the learning phase to inference, so we don’t apply dropout!

Context.local.learningPhase = .inference

Then we will render the same image we did when building the generator. It should have improved:

let generatedImage = generator(noise)
plt.imshow(generatedImage.reshaped(to: TensorShape(28, 28)).makeNumpyArray())

Then we print out the loss:

let generatorLoss_ = generatorLoss(fakeLabels: generatedImage)
print("epoch: \(epoch) | Generator loss: \(generatorLoss_)")

And again! And again! I trained the GAN for 20 epochs, but feel free to play around and change that setting.


You can create more images using the following code:

let noise = Tensor<Float>(randomNormal: TensorShape(1, 100))
let generatedImage = generator(noise)
plt.imshow(generatedImage.reshaped(to: TensorShape(28, 28)).makeNumpyArray())

Full code

You can view the notebook in my s4tf-notebooks repo. Don’t forget to star it if you haven’t already done so. You’ll also find a link to the Colab there.

A huge thanks to Ayush Agrawal for his great advice on this project!

Thanks for reading

I hope this post was useful to you. If you have any questions or comments, feel free to reach out on Twitter or email me directly at rick_wierenga [at] icloud [dot] com.