Method #4: Extract features with a CNN, pass the sequence to a separate RNN

Given how well Inception did at classifying our images, why don’t we try to leverage that learning? In this method, we’ll use the Inception network to extract features from our videos, and then pass those to a separate RNN.

This takes a few steps.

First, we run every frame from every video through Inception, saving the output from the final pool layer of the network. So we effectively chop off the top classification part of the network so that we end up with a 2,048-d vector of features that we can pass to our RNN. For more info on this strategy, see my previous blog post on continuous video classification.

Second, we convert those extracted features into sequences of extracted features. If you recall from our constraints, we want to turn each video into a 40-frame sequence. So we stitch the sampled 40 frames together, save that to disk, and now we’re ready to train different RNN models without needing to continuously pass our images through the CNN every time we read the same sample or train a new network architecture.

For the RNN, we use a single, 4096-wide LSTM layer, followed by a 1024 Dense layer, with some dropout in between. This relatively shallow network outperformed all variants where I tried multiple stacked LSTMs. Let’s take a look at the results:

The RNN handily beats out the CNN-only classification method.

Hey that’s pretty good! Our first temporally-aware network that achieves better than CNN-only results. And it does so by a significant margin.

Final test accuracy: 74% top 1, 91% top 5