August 04, 2019 · 9 mins read

Lesson 3 - Self Driving Cars

Deep learning enables us to do things that could never be done before. It is able to do things that were science fiction only 20 or even 10 years ago. One of those things is having cars drive themselves. If I were old enough to drive a car, I would probably agree that driving is annoying and I’d rather have my car drive itself. Before this is a reality, there are a number of problems that need to be solved. One of those is computer vision. It is very hard for computers to understand the environment they are in. Luckily, this is an active area of research where some of the best machine learning engineers have made huge advances in the last couple of years.

By classifying objects in the image, computers can have a better sense of their surroundings. This is much like how humans behave: we don’t look at what clothes someone is wearing or what color a car is; we just care about the presence and location of the object. A commonly used technique to do this kind of classification is image segmentation. On a high level it assigns each pixel of an image to a certain category, like cars, humans or lorries in the case of self driving.

This project is a part of my fast ai journey. You can read all the posts published so far here.

You should download the notebook from this GitHub repository in order to follow along.

The final result of the analyzation:


While convolutional neural networks compress a lot of data into a simple prediction, U-Nets output whole new images. By first mimicking CNNs, they are able to compress a lot of data into a small size. The output of this first part is a list of objects that are present in the image. Then using the original image, it assigns the each of the pixels of the original image to a class to finally return a clear representation of the real world.

A schematic of U-Nets:

This schematic perfectly clarifies the name of U-Nets.

Getting the data

The dataset I used is the CamVid dataset by Cambridge University. This is one of the best publicly available datasets of segmented driving. It happens to be integrated into fastai as well. Therefore, it’s super easy to download the data:

path = untar_data(URLs.CAMVID)

Now load the names for the images and labels like so:

path_lbl = path/'labels'
path_img = path/'images'
fnames = get_image_files(path_img)
lbl_names = get_image_files(path_lbl)

The objects used in the CamDev dataset are defined in codes.txt. You can load those using numpy.

codes = np.loadtxt(path/'codes.txt', dtype=str)

I encourage you to take a look at the codes so you know what objects the model will be learning to recongnize.

The DataBunch

Remember that fastai needs the data to be wrapped inside a DataBunch object. Because our dataset consists of two parts, video frames and masked images, the easiest way to go around creating a DataBunch is by having a function that links the frame with the masked image: image_to_mask.

image_to_mask = lambda x: path_lbl/f'{x.stem}_P{x.suffix}'

A lambda function is a function that can be stored within a variable. The reason I’m using a lambda function is that we will pass this function onto fastai.

We need to know two other things before we are ready to create a DataBunch. Namely,

  1. The batch size, and;
  2. The frame size.

In order to boost learning speed, we set the frame size to half the size of the frames in the dataset:

src_size = np.array(open_mask(image_to_mask(fnames[0])).shape[1:])
size = src_size // 2

As we’re looping over every single pixel of every image in the dataset for every epoch, I used a small batch size of 4. If you have a very high end machine, you can also use 8.

bs = 4

Finally, create the labels using the image_to_mask function:

src = (SegmentationItemList.from_folder(path_img)
       .label_from_func(image_to_mask, classes=codes))

The final step of the data preparation is extracting the identifiers for the objects:

ids = {v:k for k,v in enumerate(codes)}

Visualising the data

Images in the CamVid dataset look like this:

We can also visualize the desired result. Note that this is just a representation of what the classification; the colors represent objects we are trying to classify. Those objects are represented by indexes in the training phase.

Combining the classification and video frame gives so very neat results.

Training the U-Net

Because we are doing a custom type of classification, we need to write our own cost function for the task. I measure accuracy by taking the percentage of non-void (not identified) objects that the model got correct. (void_code is the id of void pixels.)

void_code = ids['Void']

def acc_camvid(input, target):
    target = target.squeeze(1)
    mask = target != void_code
    return (input.argmax(dim=1)[mask]==target[mask]).float().mean()

Now we create the learn object. Similar to the previous project, I used resnet34 and transfer learning for this project as well.

learn = unet_learner(data, models.resnet34, metrics=acc_camvid)

By plotting the loss for a range of learning rates, we can decide which learning rate we want to use to train the model.


As you can see, a learning rate slightly higher than 1e-03 would be good.

learn.fit_one_cycle(10, slice(lr), pct_start=0.9)'stage-1')

After this first stage completes, we can inspect the results of the model:

The results are looking good. However, we can do even better by unfreezing and further training the model.

lrs = slice(lr/400, lr/4)

learn.fit_one_cycle(12, lrs)'stage-2');

Note that we get an accuracy of about spending less than $1 on training while the researchers got only a little more than 10 years ago. The rate of improvement is incredible.

Predicting while driving

While being able to analayze images is great, we really want to be able to do this on a stream. Because we don’t have a stream of data available, we can use a subsequent set of images out of the test set.

You can write a simple for loop that loops over the training images you want to use in your video. (os.listdir). You should open the image in a format that fastai can understand. The best way to do this is using open_image.

img = open_image('path/to/image.png')

Then get the predictions for the image.

pred_class, _, _ = learn.predict(img)

The classification is stored in pred_class. If you were to build self driving software, you would use this data. For demonstration purposes, I decided to show an image for all the frames:


Concatenating all these images, we get this beautiful mapping of the world:


U-Nets are a great neural architecture for doing image segmentation. On modern hardware, this is so performant that we can outperform the state of the art technique of roughly ten years ago by a lot.

Image segmentation is identifying objects in images. One application of this is self driving cars. However, it can be used for a lot of other things as well.


(1) Segmentation and Recognition Using Structure from Motion Point Clouds, ECCV 2008 Brostow, Shotton, Fauqueur, Cipolla

(2) Semantic Object Classes in Video: A High-Definition Ground Truth Database Pattern Recognition Letters (to appear) Brostow, Fauqueur, Cipolla

(3) Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, LNCS, Vol.9351: 234–241, 2015, available at arXiv:1505.04597 [cs.CV].

A huge thanks to Sam Miserendino for proofreading this post!