2
votes

I'm trying to use Caffe for a simple semantic image segmentation task (i.e. classifying each pixel in an image as belonging to one of 2 classes).

I am stuck with two issues: 1) Data preparation, and 2) network layer definition.

I have tried to read some examples.

Although the links are useful, they don't specifically apply to semantic segmentation with 2D images.

I would extremely appreciate (even brief) code examples for the following in a smooth pipeline:

  1. Prepare the image label (2D array) in the correct format. An example using MemoryData or HDF5 as input would be perfect!
  2. Define the network prototxt correctly to input the data and the above label.

Thanks!

1

1 Answers

4
votes

While there aren't any tutorials yet on the Caffe master thread on this, there are quite a few tutorials on doing semantic segmentation in Caffe. For starters, You should look into the tutorials for the Fully Convolutional Networks master as well as the tutorial on using SegNet (GitHub separately here) or using DeepLab. These are all state of the art methods that use Caffe for semantic segmentation.

To answer your question more directly,

1) Data preparation: As someone who has showed interest in more recent deep learning approaches, you will probably have found that there is no one way to do data preparation. They depend both on what is mathematically possible (networks with fully connected layers on the end require images of the same ratio, and usually the same size), and on what improves the performance (mean subtraction). That being said, there are a few techniques that are commonplace (I will assume at this point on for simplicity sake that you can use images of varying scales as with Fully Convolutional Networks. if you want to see how cropping works, there is a good explanation of that type of data preparation in the ImageNet tutorial on Caffe). Using the Transformer class, most people do the following:

transformer.set_transpose('data', (2,0,1))  # move image channels to outermost dimension
transformer.set_mean('data', mu)            # subtract the dataset-mean value in each channel
transformer.set_raw_scale('data', 255)      # rescale from [0, 1] to [0, 255]
transformer.set_channel_swap('data', (2,1,0))  # swap channels from RGB to BGR

In the context of segmentation, that is all you need to do. The semantic labels are in the form of images themselves (usually). For example, in the Pascal VOC Caffe example, you read in the labels as

n.data, n.label = L.Python(module = 'pascal_multilabel_datalayers', layer = datalayer, ntop = 2, param_str=str(data_layer_params))

**2) network layer definition **

For network layer definition, remember that one of the brilliant bits about neural networks is that aside from the inputs and outputs, they can handle a wide variety of data types. As such, all of your intermediate layers will be the same, and actually in your case so is the input is the same. What you need at the end is something with which to evaluate the cross entropy loss relative to the image. For DeepLab, they wrote an "Interp" layer which does this. SegNet on the other hand wrote an "Upsample" layer type, which they use before the softmax to make the network output the same size as the label, and then simply use a Softmax

I guess the downside takeaway to all of this is that there isn't one clear-cut way to do this correctly yet in Caffe, but the good thing is that there are a lot of examples of it done successfully. Hope this helped