Earlier we explored image segmentation, looking into exactly what the term means, before diving into different types of models that can be used for image segmentation. Now we're going to go deeper into the model on which all of the others are based. It's the fully convolutional neural network. The architecture diagram from the original paper describing fully convolutional networks or FCNs are shown here. The model will learn the key features of the image using a CNN feature extractor, which is considered the encoder part of the model. As the image passes through convolutional layers, it gets downsampled. Then the output is passed to the decoder section of the model, which are additional convolutional layers. The decoder layers upsamples in the image step-by-step to its original dimensions so that we get a pixelwise labeling, also called pixel mask or segmentation mask of the original image. The encoder can use the convolutional layers of a traditional CNN architecture. Note that the fully connected layers of these traditional CNN models are used for classification in object detection tasks, so the encoder of the image segmentation models won't reuse those fully connected layers. Common architectures that it can reuse are VGG-16, ResNet-50, and MobileNet. But of course, you can design and use your own. What allows you to take the CNN from the encoder and turn it into an architecture that gives you image segmentation is the decoder. Popular decoders like that we'll look at in detail are FCN-32, FCN-16, and FCN-8. Their outputs are shown in the original paper right here. Let's look at these in detail, and we'll start with FCN-32. As a quick review that might help you understand the decoder, let's review what happens with a pooling layer. As an example, I'm going to start with a tiny image here that has eight pixels and two columns and four rows. If you perform pooling with a window size of 2 by 2, such as average pooling, the first application of the pooling window applies to the top four cells of the image, and it pools the four values into a single value. If you choose a stride of 2 by 2, the pooling window will slide two cells down. Then the pooling as applied to the bottom four cells of the image, and it pools the image into a single value that you can see here. Notice that the input image has four rows, but the pooling result has two rows. Also notice that the input image had two columns, but the pooling result has one column. If you have a pooling layer with a 2 by 2 pooling window on a stride of 2 by 2, the result of your pooling will reduce the height and width by half. Now let's look at the FCN-32 decoder architecture. Recall, like we just said, that when you pool an image with a 2 by 2 window size and a stride of 2 by 2, you'll reduce the image in half along each axis, so 256 by 256 image would get pooled to 128 by 128 and so on. The architecture has five pooling layers. Each pooled result gets its dimensions reduced by half, five times. The original image gets reduced by a factor of 2_5 of 32. If the output of the final pooling layer, which we're calling pool 5, is upsampled back to the original image size, it needs to be upsampled by a factor of 32. This is done by upsampling with a stride size of 32, which means that each input pixel from Pool 5 is turned into a 32 by 32 pixel output. This 32 times upsampling is also the pixelwise prediction of classes for the original image. That's what the FCN-32 decoder does, and that's where it gets its name from. FCN-16 works similarly to FCN-32, but in addition to using pool 5, it also uses pool 4. In step 1, the output of pool 5 is upsampled by a factor of two, so the result has the same height and width as pool 4. Separately, we use the output of pool 4 to make a pixelwise prediction using a one by one convolution layer. But don't worry about the details of that one by one convolutional layer yet, we'll look into that a little later. The pool 4 prediction is added to the 2x upsampled output of pool 5. The output of this addition is then upsampled by a factor of 16 to get the final pixelwise segmentation map. Upsampling with a stride of 16 takes each input pixel and outputs a 16 by 16 grid of pixels, so this decoder type is named FCN-16. FCN-8 decoder works very similar with the same first two steps, but instead of upsampling the summation of the pool 4 and 5 predictions by 16, it will 2x upsample it, and then add that to the pool 3 prediction. This is then upsampled by eight, and hence the decoder is named FCN-8. Going back to this image, we can see the impact of this by factoring in the results from pools earlier in the architecture, when the image is at a higher resolution, our segments are better defined. Thus, the FCN-8 looks better than the FCN-16, and better than the FCN-32. Of course, depending on your scenario, the FCN-32 might be enough, but it might not be worth the extra processing required to do FCN-16 or FCN-8.