What breed is my dog?

Automatic dog breed detection using CNN & transfer learning

Published in

Nerd For Tech

7 min readMay 21, 2021

I’ve always thought my dog was a Bichon Frise until some friends last week questioned its breed: while most Bichon Frise’s are white, my dog’s coat is apricot coloured. Is colour though a key sign determining breed? Can’t an apricot dog still be a Bichon Frise?

Luckily, classifying dogs by their breed is a task easy enough for deep learning models. While the exercise can still be challenging even for humans (and the focus of many heated debates), in this project we prove that Convolutional Neural Networks with Transfer Learning techniques can achieve a test accuracy greater than 80% on this task.

First things first: what do we want to achieve? Problem Statement

The goal of this project is to accurately predict the dog breed given a dog picture (more specifically, given my dog pic). Predicting the breed of a dog is a challenging task and it’s an exercise that can be even difficult for humans. The key difficulties involved are the following:

Minimal inter-class variation (similarities between different dog breeds): for example, in terms of colour and size. For instance, consider a Brittany and a Welsh Springer Spaniel:

Inter-class variation: Brittany dog breed (left) vs Welsh Springer Spaniel (right)

2) Intra-class variation: for example, consider different types of Labrador with different colours or sizes (yellow, chocolate, black…).

How will we solve the problem? Analytics approach

Given a picture of a dog or a human, we want to, first, identify whether the image corresponds to a dog or to a human. Next, if it’s a dog, we want to accurately predict its breed. On the other hand, if it’s a human, we will have some fun and show the user its resembling dog breed!

If our algorithm can neither detect a dog nor a human on the picture, then we’ll throw an error message.

To solve the problem defined, we take the following steps:

Data Exploration: loading the datasets and understanding the training dataset
Build Human Detector using Haar cascades for face detection
Build Dog Detector using pre-trained ResNet50 model
Understand the limitations of a CNN model created from scratch
Use Transfer Learning and fine-tune a pretrained ResNet50 model to predict the most accurate dog breed for dogs and the most resembling dog breed for humans
Build the algorithm to return different predictions based on the content of the input image

Metrics

To evaluate the performance of the classification model we'll use the Accuracy metric, that is, the proportion of correct predictions among the total number of cases examined. While accuracy is generally not a fair metric when evaluating performance in imbalanced datasets, in our case, random chance presents a very low bar: a random guess would provide a correct answer roughly 1 in 133 times, which corresponds to an accuracy of less than 1%.

Data Exploration & Pre-Processing: what data are we using?

To classify the breed we use the dog breed dataset available from Udacity. In this dataset we have 133 different dog breeds (including Bichon Frise!) and a total of 8351 dog images, comprised by a set of 6680 training images, 835 validation images and 836 test dog pics:

Before feeding the data into any model, we need to pre-process it. When using TensorFlow as a backend, Keras CNNs require a 4D tensor as input, hence, we use the function path_to_tensor to reshape the data as a 4D tensor. We also resize the image to a square image so that it can be consumed by the CNN models (224 pixels x 224 pixels). Because we’re working with colour images, each picture has three colour channels, so the final tensor for a given image has the shape: (1, 224, 224, 3).

Why CNNs and Transfer Learning? Modelling approach

Convolutional neural networks, CNNs, are powerful deep learning neural nets widely used for image classification. As compared to other modelling techniques such as MLPs, CNNs are able to extract complex patterns in multidimensional data and learn space invariant features by applying filters during the training process.

Before trying out more advanced methods, to predict dog breeds we first build a small CNN from scratch using convolution and max-pooling layers followed by a flatten, a drop-out and a couple of dense layers:

In this network, we use 8 layers (convolution and max-pooling) with increasing filter size in subsequent steps to allow the model to capture more complex patterns of features (we start with a filter of 32 and increase it to 128). After extracting spatial features through convolutional and max pooling layers, we flatten the output and apply drop out to prevent overfitting (we set it to 0.5, so we randomly drop half of the layers). The flattened output is fed into a dense layer to capture non-linearities and improve the accuracy of the model. Finally, the last layer has an output size of 133 since that’s the amount of different dog breeds to be learnt.

This small net though isn't deep enough and was just trained on the small training dataset available for this project (6680 images), hence, it wasn't surprising that it achieved low test accuracy (about 13%).

To improve the performance of the breed classification model we leverage transfer learning: the key idea of transfer learning is to import bottleneck features (in our case, from a pre-trained VGG16 model first, and then a ResNet50 model), freeze the weights of these layers and train only the last dense layer to fine tune the model to a given classification task. Following these steps, the accuracy of the model in the test dataset improved to 13% to more than 80% (ResNet50 model).

Model Evaluation, Validation & Justification

To understand how different models performed, we take a look at the accuracy and loss over the 20 epochs. As we can see below, the CNN trained from scratch overfit the data: while accuracy improved on the training dataset, it stayed flat on the validation set. Also, after the 5th epoch, the validation loss increased while the training loss decreased: this indicates that the model is unable to generalise well and, therefore, isn't robust enough for this task.

Model evaluation: CNN built from scratch, VGG16, ResNet50

On the other hand, both VGG16 and ResNet50 models, when leveraging Transfer Learning, improved the generalisation of the model: improving the accuracy in the validation set with each epoch and achieving better test accuracies on the test dataset (>40% for the VGG16 model and >80% for the ResNet50 architecture).

Since the ResNet50 model achieved the best accuracy on the test dataset, in this exercise we choose this method to build the dog breed classification algorithm.

So what? Is my dog a Bichon Frise then?

Of course it is! But how does the network know? Was it the colour? The eyes? The size?

The reality is that the CNN model can't explain why it predicted a given breed or tell us anything about the key features that influence a given prediction. So while we've got a confirmation of the breed, we haven't been able to learn anything about the specific characteristics we should consider when evaluating a dog breed: in other words, I still haven't got a logical argument to explain why my dog is a Bichon Frise other than "similar dogs in the training dataset were labelled as Bichon Frise".

Conclusions & Caveats

The value of deep neural networks is efficiently matching and recognising patterns in multidimensional data, rather than explaining why a given instance corresponds to a given classification.
Transfer Learning is powerful: deep nets are data hungry and Transfer Learning can significantly improve the performance of the network when the training dataset isn't big enough. In this project, we successfully leveraged Transfer Learning because our dataset had significant overlap with the training set used to pre-train the ResNet50 model. In other contexts, it may be challenging to find an appropriate pre-training set.
How can I explain to others why the breed of my dog is Bichon Frise and what are the specific and unique features that confirm this fact? In future posts I'll explore explainable and white-box methods that can provide a clear answer into how different features and signals can influence a given prediction.

What to learn more or even try yourself?

Check out this Github repository.