PlantVillage has gathered a collection of more than 54'000 images over time.
Pictures were shot by different people, using different cameras with different automatic
adjustments, under varying lightning conditions. Some are already segmented. Here are a few examples:
Click one of these images
to set the which one
to use to illustrate
the article.
Technology, particularly computer vision, now makes it possible to "teach" a computer to recognise features in images. If one has enough images with known features, an "educated machine" can achieve a level of performance in recognition that competes with that of trained humans.
One of our dreams at PlantVillage was that our images be used to train an algorithm which, in turn, could be distributed over the world to diagnose plant diseases with a smartphone.
So-called deep learning with neural networks is the tool of choice to train a computer to recognise features. The idea is to have the neural network ingest an image, and you tell it what it has just seen. If you do that thousands or millions of times, it, little by little, "learns". Image-based diagnosis is mainly about visual features (as opposed to context, for instance), so our dream could turn real.
One big concern about teaching a machine is that you should make sure it learns what you want and not something else. If you want to teach a computer to tell between apples and bananas, and all your images of apples are black and white, whereas your images of bananas are in color, it might simply reach the obvious conclusion that all black and white pictures in the world are pictures of apples.
Of course, at PlantVillage, our images are all in color. But, amongst all images, one could easily spot features that are specific to one image set or another. For instance, most of our images of healthy apple leaves could be blueish, or most of our tomato early blight could be on a light background with strong and sharp shadows. We, as humans, easily spot these features as non-relevant; it's not as obvious for a computer.
So came the idea to process our images so as to remove the background and keep the leaf only. This process is called segmentation.
Our approach was empirical. We took an image and tried to find a way to separate the leaf from the background. Then we adjusted the technique so that it worked with more and more images, until we reached a near-perfect or at least decent result with almost all of them.
The steps are:
If you take a picture of a uniform background under non-uniform lightning conditions, you'll end up
with a gradient in all R, G and B channels (in the RGB color space), whereas only the L channel will
be affected (in the Lab color space).
[IMAGES]
We convert the original RGB image into the Lab color space:
The contrast is automatically adjusted. Then, slight lightness gradients are partly removed with band masking in the frequency space
via Fourier transform.
Channels L, a and b are slightly blurred to decrease noise:
We expect at least one of the borders of the image to show the background. We thus extract the 4 few-pixels-wide edges:
The median level of each edge image is calculated for the a and b channels.
The border median color that is farthest away from green is considered the be the most likely to be that of the background and will be the starting point for the background detection
If the color is too close to green, we consider that no background was found and skip the rest of the processing.
Since we know that our pictures were taken over a light-gray background, the values of a and b for all the pixels are shifted so that the median values found in the previous step become that of gray (0.5),
By correcting the a and b channels, we make sure that the background is the right color, and the leaf recovers its natural color.
This part is where most of the empirical tweaking went. A mask is an image with all pixels
set to white or black, corresponding to a binary choice (good/bad, keep/don't keep, etc.).
After several steps, a set of masks is created, ending up with the following masks:
This mask is generated by comparing the value of each pixel, in the a and b channels, to the corresponding of the background. A threshold is applied to keep only pixels that are supposedly not part of the background.
The shadow mask tries to keep only pixels that are not too dark, and with a color not too far from that of the background.
The color and shadow masks are combined so that the shadows areas are removed from
the color mask.
The mask is processed to remove isolated white or black pixels, through erosion and dilatation.
A flood fill is then applied from the border to select the whole background and the mask is slightly blurred.
The reconstructed image with color fix is now masked, i.e. the background is turned to
black.