Cassava Leaf Disease Classification: First Steps

Alex Zimbalist
5 min readFeb 23, 2021

By Brett Cotler and Alex Zimbalist

Note: all the code corresponding to this blog post can be found at https://www.kaggle.com/bcotler/eda-baseline-cassava-data2040-lobster-rolls

Deep Learning has revolutionized the ability of computers to classify images. Kaggle, a platform where users can post datasets and objectives to crowdsource data science solutions, provides a way to compete and practice using machine learning tools. For our Deep Learning course in the Brown Data Science Initiative, we are tasked with participating in the Kaggle Competition, “Cassava Leaf Disease Classification.” In a nutshell, the task is to create a model that is able to identify based on a picture whether a leaf has a disease, and if so, which one of four diseases it is afflicted with. The possible classifications are (0) Cassava Bacterial Blight (CBB), (1) Cassava Brown Streak Disease (CBSD), (2) Cassava Green Mottle Disease (CGMD), (3) Cassava Mosaic Disease (CMD), and (4) Healthy. While this post is concerned primarily with preparatory steps, our eventual goal is to use a convolutional neural network (CNN) to accurately classify a leaf’s disease based on a picture.

The first steps in answering machine learning questions such as this one are to import the relevant data (since this is a Kaggle competition, we do not need to collect the data ourselves), visualize, and otherwise seek to understand what kind of data we are working with (this is known as exploratory data analysis, or EDA), and then create a baseline model to which our future (better) models can be compared.

This is the code that was used to download the data from Kaggle into our file. We used Google Colab because it contains GPUs for more efficient computing.

Next, we want to look at the data and get a feel for it. First, we checked to see if any of the disease labels for the provided images were missing, as we would want to remove such missing values. Fortunately, all labels were provided. Next, we wanted to get a sense of how many of each label existed in the classification set. For instance, of the 5 possible leaf diseases, there might be one or two diseases that are far more common. Indeed, we find that Cassava Mosaic Disease, encoded by the number 3, is far more common than the other 4 diseases.

Classification label 3 is by far the most common

Additionally, we want to get a general feel for what each disease looks like. This is important for building intuition about the task at hand.

From top to bottom, we have classification labels 1, 2, 3, 4, and 5

At the very least, we should hope that our neural network (deep learning) model does better than if it just guessed that every disease was disease 3, the most common disease. Therefore, our baseline model is quite simple: always predict disease 3. It turns out that this baseline model, due to the frequency of disease 3 in the disease labels, has an accuracy of just over 61%.

Next, we created an un-tuned, first attempt convolutional neural network for the classification problem. As a first step, we used Keras’ ImageDataGenerator to create train and test sets and to preprocess all the images so they are ready to be used for training.

preprocesses images so that the CNN can use them to learn

After ImageDataGenerator worked its magic, we created a sequential model with several convolutional layers followed by a few dense layers with decreasing numbers of neurons in each subsequent layer. For all layers except the final layer, we use ReLU activation, as this generally performs well and mitigates the risk for exploding or vanishing gradients, a common problem plaguing neural networks. Our final layer has 5 neurons and uses softmax activation, such that the model outputs a predicted probability for each of the 5 classifications. We also added a couple of layers introducing dropout, which essentially tells neurons in certain layers not to fire with some probability. This dropout will hopefully help prevent overfitting. As another attempt to prevent overfitting, we add a regularization parameter to some layers in our network, which penalizes large weights. For the convolutional layers, we add a zero-padding parameter to ensure that the sizes of the arrays being fed into each layer are the same as the sizes of the arrays being passed to the following layer. Additionally, we add a couple of max-pooling layers, which shrinks the size of the image and produces sharper features, which might help with classification. When we fit a model using the training images, we add an early stopping parameter, which may terminate the training process early once again to reduce the risk of overfitting. The model architecture is shown below:

Preliminary CNN Architecture

We realize that this explanation really skims over our reasoning for why we include various layers or parameters. This is largely because we didn’t put too much thought into this very preliminary CNN! In our next post, we will further explain these aspects of the model architecture as we begin to tweak the model to hopefully achieve drastically better results.

Below are the results from training our preliminary CNN on the train images and making predictions on the test set:

Preliminary CNN trained with 20 epochs (early stopping with patience=5 was not triggered)

We can better understand the model training history with a couple of simple visualizations:

The training and testing loss barely improve after about 3 epochs
The model accuracy likewise fails to make dramatic improvements after about 3 epochs

So far, our CNN (61.7% validation accuracy) performs almost exactly the same as our baseline model (61.5% accuracy), which classified each image with the label 3, or as Cassava Mosaic Disease. Clearly, there is room for improvement! Stay tuned for our next blog post :)

Next Steps

  • Data augmentation
  • CNN Parameter / Hyperparameter tuning and experimentation
  • Solving other issues that will inevitably arise

--

--