I have approached a similar problem previously in Python. Hopefully this is helpful and you can come up with an alternative implementation in Matlab if that is what you are using.
After all was said and done, I landed on a single model for all predictions. For your purpose you could have one binary output for dog vs. cat, another multi-class output for the dog breeds, and another multi-class output for the cat breeds.
Using Tensorflow, I created a mask for the irrelevant classes. For example, if the image was of a cat, then all of the dog breeds are irrelevant and they should not impact model training for that example. This required a customized TF Dataset (that converted 0's to -1 for the mask) and a customized loss function that returned 0 error when the mask was present for that example.
Finally for the training process. Specific to your question, you will have to create custom accuracy functions that can handle the mask values how you want them to, but otherwise this part of the process should be standard. It was best practice to evenly spread out the classes among the training data but they can all be trained together.
If you google "Multi-Task Training" you can find additional resources for this problem.
Here are some code snips if you are interested:
For the customize TF dataset that masked irrelevant labels...
# Replace 0's with -1 for mask when there aren't any labels
def produce_mask(features):
for filt, tensor in features.items():
if "target" in filt:
condition = tf.equal(tf.math.reduce_sum(tensor), 0)
features[filt] = tf.where(condition, tf.ones_like(tensor) * -1, tensor)
return features
def create_dataset(filepath, batch_size=10):
...
# **** This is where the mask was applied to the dataset
dataset = dataset.map(produce_mask, num_parallel_calls=cpu_count())
...
return parsed_features
Custom loss function. I was using binary-crossentropy because my problem was multi-label. You will likely want to adapt this to categorical-crossentropy.
# Custom loss function
def masked_binary_crossentropy(y_true, y_pred):
mask = backend.cast(backend.not_equal(y_true, -1), backend.floatx())
return backend.binary_crossentropy(y_true * mask, y_pred * mask)
Then for the custom accuracy metrics. I was using top-k accuracy, you may need to modify for your purposes, but this will give you the general idea. When comparing this to the loss function, instead of converting all to 0, which would over-inflate the accuracy, this function filters those values out entirely. That works because the outputs are measured individually, so each output (binary, cat breed, dog breed) would have a different accuracy measure filtered only to the relevant examples.
backend
is keras backend.
def top_5_acc(y_true, y_pred, k=5):
mask = backend.cast(backend.not_equal(y_true, -1), tf.bool)
mask = tf.math.reduce_any(mask, axis=1)
masked_true = tf.boolean_mask(y_true, mask)
masked_pred = tf.boolean_mask(y_pred, mask)
return top_k_categorical_accuracy(masked_true, masked_pred, k)
Edit
No, in the scenario I described above there is only one model and it is trained with all of the data together. There are 3 outputs to the single model. The mask is a major part of this as it allows the network to only adjust weights that are relevant to the example. If the image was a cat, then the dog breed prediction does not result in loss.