CS663

Exercise: Video Activity Recogntion using (CNN + LSTM)

modified from "Hands-On Comptuer VIsion with Tensorflow 2 by B. Planche and E. Andres.

You are going to go over and implement Video Activity Recognition using CNN for feature extraction followed by LSTM
and a final decision layer.

Using TensorFlow 2 featuring tf.keras built in Keras

>>>>AFTER reading everything look at STEP 14 before you get started- ROADMAP saying LSTM will come 2019

FIRST: dowload the dataset UCF-101 found here https://www.crcv.ucf.edu/data/UCF101.php
(remember where you installed it to modify the python code to point to data directory)

BASIC INSTRUCTIONS

1) Assumes you have already installed latest TensorFLow

2) Assumes you have already installed Jupyter Notebook

3) GET Jupyter Notebook file here that represents the code below and look at the step by step explanation below

4) AND Refer to the Tensorflow api for changes and detail --it is up to YOU to fix any errors or discrepenencies in the code and changing APIs.

5) RUN the code as is and using TensorBoard show the results of training and validation

6) Figure out how to store the trained model and convert to a mobile ready model and integrate into an Android app -- extend
previous exercises to have a spinner option called "ActiveRec" .

7) Make a Youtube video (one per group)showing theJupyeter notebook running at different stages (cut, and paste the video --I don't want to see it training for too long--but, show me the results after each step of the code to the end). Also show the TensorBoard running AND show the Mobile Application running. Post to BB-Exercises-Ex:Learn Training

Jupyter Notebook

Corresponding Python

STEP 1: Install modules

STEP 2: Import modules

STEP 3: setup Variables including the path to where you installed the UCF-101 dataset
NOTE: SEQUENCE_LENGTH specified how many frames in your LSTM you will process for
each video

STEP 4: Use the Inception V3 CNN that is an existing trained model in TensorFlow(see api on it) to perform
feature extraction for us (rather than developing are own).

STEP 5: Using the pre-built Inception V3 model as a CNN for our feature extraction
read about Inception here and see api and see the better keras documentation on inception v3

Create Inception v3 instance where:
- include_top = false MEANS no fully connected layer --> HENCE this does not have the fina fully connected layer, and is just the feature extraction part of the network.
  
  weights= 'imagenet' MEANS load the InceptionV3 architecture network pre-trained on ImageNet dataset
make NEW model that is basically the InceptionV3 without final fully connected layer but ADD the GlobalAverage Pooling 2D layer (which will flaten the output to 1x2048) --> call it feature_extraction_model

STEP 6: Extract features using the InceptionV3 CNN model loaded in step 4...
.....will store the features in .npy file
NOTE: Extraction takes about ~1h20 minutes on an NVIDIA 1080 GPU
I DONT have this good GPU so mine took MUCH LONGER --and was doing other work at
same time. THIS STEP --may take a while!!!
....at the end the "progress bar (tqdm ) updates for each iteration / image through dataset
( Total 33,085 iterations= #image frames from all videos = #samples/video * #videos)

tqdm is a progress manager that performs the printout at the bottom of the block every time a new image is pulled
from the dataset see https://tqdm.github.io/docs/tqdm and also see https://github.com/tqdm/tqdm

cycle through the images in the dataset AND

call feature_extracton_model (Inception v3) on the image -> will yield a 1x2048 feature vector

reshape the feature vector output see api : https://www.tensorflow.org/api_docs/python/tf/reshape NOTE: the -1 dictates that the output tensor shape will be to

# tensor 't' is [1, 2, 3, 4, 5, 6, 7, 8, 9]
# tensor 't' has shape [9]
reshape(t, [3, 3]) ==> [[1, 2, 3],
                        [4, 5, 6],
                        [7, 8, 9]]

# tensor 't' is [[[1, 1], [2, 2]],
#                [[3, 3], [4, 4]]]
# tensor 't' has shape [2, 2, 2]
reshape(t, [2, 4]) ==> [[1, 1, 2, 2],
                        [3, 3, 4, 4]]

# tensor 't' is [[[1, 1, 1],
#                 [2, 2, 2]],
#                [[3, 3, 3],
#                 [4, 4, 4]],
#                [[5, 5, 5],
#                 [6, 6, 6]]]
# tensor 't' has shape [3, 2, 3]
# pass '[-1]' to flatten 't'
reshape(t, [-1]) ==> [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6]

# tensor 't' is [[[1, 1, 1],
#                 [2, 2, 2]],
#                [[3, 3, 3],
#                 [4, 4, 4]],
#                [[5, 5, 5],
#                 [6, 6, 6]]]
# tensor 't' has shape [3, 2, 3]


# -1 is inferred to be 9:
reshape(t, [2, -1]) ==> [[1, 1, 1, 2, 2, 2, 3, 3, 3],
                         [4, 4, 4, 5, 5, 5, 6, 6, 6]]
# -1 is inferred to be 2:
reshape(t, [-1, 9]) ==> [[1, 1, 1, 2, 2, 2, 3, 3, 3],
                         [4, 4, 4, 5, 5, 5, 6, 6, 6]]

convert the batch_features feature vector to a numpy array (see doc. on numpy)
for all of the images in the same avi append to the feature vector numpy array and then write it out to a .npy file in the same directory as the avi file.

this is what your directories will look like --they will contain BOTH the original avi files and now for each
one a .npy file that contains the features

STEP 7: Create Labels of our classes

Load the Labels for the Video Activities we have data for and will use to train our LSTM
Make it so 1 or 0 to tell if belongs to or does not belong to a class...uses LabelBinarizer part of scikit (read https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)

STEP 8: Define Network with LSTM to train for Video Activity Recogntion

Layer 1= Masking layer (and see Keras doc) which is used in RNN, should always insert a MaskingLayer(mask_value=0) before the LSTM layer. For each timestep in the input tensor (dimension #1 in the tensor), if all values in the input tensor at that timestep are equal to mask_value, then the timestep will be masked (skipped) in all downstream layers (as long as they support masking).
Layer 2 = LSTM and also

IMPORTANT: what you are really defining is what ONE LSTM cell looks like ---you will have SEQUENCE_NUMBER cells (as defined above)

512 #units

IMPORTANT: the first parameter #units is not the number of cells you have, but rather the cell's internal state (hidden state) vector that is produced and fed into the next cell (time in sequence). There is no "correct" value for the #units. Generally the larger number will yield more "parameters" in your model to learn, it will require more training and more data BUT can also be more descriminating. Think of this like having a larger number of layers in a CNN. READ this blog for a better understanding.
dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
recurrent_dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.

Layer 3= fully connected (Dense) layer - see documentation and keras doc. Number of output values=256

Layer 4 = Dropout

Layer 5= finaly fully connect softmax decision layer --output has # classes/labels values.

STEP 9: Specify the loss function & optimizer function & metrics you wish to calculate for training

compile: Specified loss as categorical_crossentropy which is the cross-entropy loss between true labels and predicted labels.

compile: Specified optimizer as rmsprop which is used to adjust weights as training as described here

compile: Specified metrics to caclulate accuracy as the top_k_categorical_accuracy (see better documentation here) - default value of k=5, calculates the top-k categorical accuracy rate, i.e. success when the target class is within the top-k predictions provided.
- read here for some other metric options and here
- here is a different choice for metrics=['accuracy', mean_pred])

STEP 10: Specify files to use in testing (test_file) and in training (train_file). Create function make_generato
r which will take in an array of files and randomly select from it building out the list as the .npy feature files for each
avi in this set of files

To give an idea of the balance between train and testing - for training on the "ApplyEyeMakeup" there
are 102 videos in the training dataset. In the testing dataset there are 44 videos.

Format for trainlist01.txt

FileName ClassLabelNumber

ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi 1
ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c02.avi 1
ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c03.avi 1
ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c04.avi 1

Format for testlist01.txt

FileName

ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c01.avi
ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c02.avi
ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c03.avi
ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c04.avi

This is what the train_list looks like:

STEP 11: Create Training dataset (train_datset) and Validation dataset (valid_daataset)

Create Datasets from the list of training feature files (train_list step 10) and validation feature files (test_list step 10)

create the datasets: train_datset and valid_dataset. Note we are batching 16 & prefetching. Also,each training sample: has SEQUENCE_LENGTH (40--see above where defined) feature vectors of length 2048 and a Label associated with it.

batch = the number of training samples to work through before the model's internal parameters are updated.

tf.data.Dataset.prefetch= can be used to decouple the time when data is produced from the time when data is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested. The number of elements to prefetch should be equal to (or possibly greater than) the number of batches consumed by a single training step. Read about ways to improve performance of training here.

Rather than setting the amount to prefectch manually used tf.data.experimental.AUTOTUNE which will prompt the tf.data runtime to tune the value dynamically at runtime (does not publish how this is done)

STEP 12: Train your LSTM

Creating a callback function pointing to tf.keras.callback.TensorBoard which will store information (checkpoints,etc) that is used
with TensorBoard a visualization tool used in TensorFlow to show the trends in training and validation --will perform
plots of data etc.

update_freq: 'batch' or 'epoch' or integer. When using 'batch', writes the losses and metrics to TensorBoard after each batch.
The same applies for 'epoch'. If using an integer, let's say 1000, the callback will write the metrics and losses to TensorBoard
every 1000 samples. Note that writing too frequently to TensorBoard can slow down your training.

Epoch = controls the number of complete passes through the training dataset..

Calling model.fit:

passing train_dataset & valid_dataset

specifying will run for 17 Epochs (a stopping condition)

specifying the callback function to call at certain points (trigger points)

Ignore the warning --but, notice that with each Epoch the accuracy is improving

@Epoch 1: accuracy = 0.0625
@Epoch 5: accuracy = 0.6662

STEP 13: SAVE your model --using same model.save as found in the CNN example

STEP 14: CONVERT LSTM model to tflite--- you have to investigate this --https://www.tensorflow.org/lite/convert/rnn (THERE are limitations)

>>>>>NOTE here is a Colab that does training from MNIST data (included with tensorflow, so do not have to do any dataset setup), includes conversion to TFLite

SUPPOSEDLY THE FOLLOWING IS NO LONGER TRUE AND TFlite Conversion of LSTM is possible ---until we know for sure leaving the info below.

The TensorFlow Lite interpreter currently implements a subset of TensorFlow operations, meaning some model architectures cannot immediately be converted due to missing operations.

Some RNN-based architectures are affected by this.

Option:

Deploy your trained and saved TensorFlow model to cloud (follow official directions ) and make request/response via REST (web-based request/response). I believe you will have to deploy using a Saved Model format (not h5) https://www.tensorflow.org/guide/saved_model You will "deploy" using Tensorflow Serving

ROADMAP saying LSTM will come 2019

Possible OPTION to get TFlite to work???? https://www.youtube.com/watch?v=gHN0jDbJz8E&feature=youtu.be --only for limited ops or for flow-control like RNNs?? OR LOOK at : For models with LSTM/RNNs, you can also try the experimental API OpHint to convert

STEP 7:Run prediction on a SINGLE video clip (from file) --mimicing parts of what you need to do
in a run-time scenario:

7.1=capture SEQUENCE_LENGTH images in video clip
7.2= process each Image -> resize & convert to tensor, call preprocessor for Feature Extractor model ( this mimics what is done to images in training)
7.3= take processed image tensor and pass to Feature Extractor CNN -> output here is tensor (1x2040) feature vector AND store
feature vectors in an array
7.4= convert your array of (40x feature_vector_tensors(1x2048) ) -> a tensor which will be input to your LSTM model
7.5= call prediction on your tensor representing your video on your LSTM model

STEP 7.1 capture SEQUENCE_LEGTH images in video clip & 7.2 - process each image & 7.3 call Feature Extraction CNN

#  create function to go through video specified and return an array of images of lenght SEQUNCE_LENGTH
#          each image is resized and preprocessed for input into a FeatureExtractor InceptionV3 CNN
#          then run through the FeatureExtractor --output will be a 1x2048 feature vector for each image
#          append to the set of features and return
#          features is an array of SEQUENCE_LENGTH (40) Tensors (each 1x2048 in length)
def grabImagesFromVideo_Process_with_InceptionCNN_FeatureExtractor(video_path, feature_extraction_model):
    print("    going to process " + str(video_path))

    features = []
    cap = cv2.VideoCapture(video_path)
    num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    sample_every_frame = max(1, num_frames // SEQUENCE_LENGTH)

    max_images = SEQUENCE_LENGTH

    # cycle through the frames in the video
    for current_frame_index in range(num_frames):

        print("     on frame" + str(current_frame_index))

        # read in next frame from video
        success, frame = cap.read()

        if not success:
            break

        # take every kth(sample_every_frame) frame and store in frames array
        if current_frame_index % sample_every_frame == 0:
            # OPENCV reads in BGR, tensorflow expects RGB so we invert the order
            frame = frame[:, :, ::-1]

            # appropriately resize and preprocess the image for Feature Extraction with inceptionV3 CNN
            # CONVERTS TO A TENSOR from an array   with the size of 299x299
            img = tf.image.resize(frame, (299, 299))

            img = tf.keras.applications.inception_v3.preprocess_input(img)

            print(" img currently is:")
            print(img)
            print(" img shape is " + str(img.shape))

            tensor_input = tf.expand_dims(img, axis=0)
            print(" expanded dimension tensor now is " + str(tensor_input))
            print("     --shape is" + str(tensor_input.shape))

            # Diagnostics: run the "graph" to print out tensor object
            if (False):
                with tf.Session() as sess:
                    # sess.run(init_op) #execute init_op
                    # print the random values that we sample
                    print(" content img")
                    print(sess.run(img))
                    print(" _________________")

                    print(" expanded dimension tensor now is " + str(tensor_input))
                    print("     --shape is" + str(tensor_input.shape))
                    print(" content tensor_input")
                    print(sess.run(tensor_input))

            # now going to process with the feature extraction model (inceptionv3 CNN based)
            current_features = feature_extraction_model(tensor_input)

            # current_features = feature_extraction_model.predict(img, steps=1)
            # reshape the tensor to shape ( #features x 1)--see https://www.tensorflow.org/api_docs/python/tf/reshape
            current_features = tf.reshape(current_features, (current_features.shape[0], -1))

            # convert tensor current_features to an numpy array !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
            # only works in TF 2.* !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
            current_features = current_features.numpy()

            features.append(current_features)

            # reduce counter
            max_images -= 1

        # if we have sampled SEQUENCE_LENGTH number of frames then stop
        if max_images == 0:
            break

        # now make sure we have length SEQUENCE_LENGTH and pad with zeros if necessary

        # padded_sequence = np.zeros((SEQUENCE_LENGTH, 2048))
        # padded_sequence[0:len(features)] = np.array(features)

    return features


# step 7.1& 7.2 call function passing a video_path and returning the array of images of SEQUENCE_LENGTH
# step 7.3 call function passing a video_path and feature extraction model
#          extract images of SEQUENCE_LENGTH and process each with feature extractor
#          returning the array of features of SEQUENCE_LENGTH

features = grabImagesFromVideo_Process_with_InceptionCNN_FeatureExtractor(video, feature_extraction_model)

print("  Returned set of features is " + str(features))

VARIOUS Printouts showing steps in - 7.1,7.2& 7.3
7.4 convert the above array of SEQUENCE_LENGTH tensors to a tensor

7.5 run the tensor input through our LSTM model for prediction

modified from "Hands-On Comptuer VIsion with Tensorflow 2 by B. Planche and E. Andres.

You are going to go over and implement Video Activity Recognition using CNN for feature extraction followed by LSTM and a final decision layer.

Using TensorFlow 2 featuring tf.keras built in Keras

>>>>AFTER reading everything look at STEP 14 before you get started- ROADMAP saying LSTM will come 2019

FIRST: dowload the dataset UCF-101 found here https://www.crcv.ucf.edu/data/UCF101.php (remember where you installed it to modify the python code to point to data directory)

BASIC INSTRUCTIONS

1) Assumes you have already installed latest TensorFLow

Jupyter Notebook

Corresponding Python

STEP 1: Install modules

STEP 2: Import modules

STEP 3: setup Variables including the path to where you installed the UCF-101 dataset NOTE: SEQUENCE_LENGTH specified how many frames in your LSTM you will process for each video

STEP 4: Use the Inception V3 CNN that is an existing trained model in TensorFlow(see api on it) to perform feature extraction for us (rather than developing are own).

STEP 5: Using the pre-built Inception V3 model as a CNN for our feature extraction read about Inception here and see api and see the better keras documentation on inception v3

Create Inception v3 instance where:

include_top = false MEANS no fully connected layer --> HENCE this does not have the fina fully connected layer, and is just the feature extraction part of the network. weights= 'imagenet' MEANS load the InceptionV3 architecture network pre-trained on ImageNet dataset

make NEW model that is basically the InceptionV3 without final fully connected layer but ADD the GlobalAverage Pooling 2D layer (which will flaten the output to 1x2048) --> call it feature_extraction_model

tqdm is a progress manager that performs the printout at the bottom of the block every time a new image is pulled from the dataset see https://tqdm.github.io/docs/tqdm and also see https://github.com/tqdm/tqdm

cycle through the images in the dataset AND

call feature_extracton_model (Inception v3) on the image -> will yield a 1x2048 feature vector

reshape the feature vector output see api : https://www.tensorflow.org/api_docs/python/tf/reshape NOTE: the -1 dictates that the output tensor shape will be to

convert the batch_features feature vector to a numpy array (see doc. on numpy)

for all of the images in the same avi append to the feature vector numpy array and then write it out to a .npy file in the same directory as the avi file.

this is what your directories will look like --they will contain BOTH the original avi files and now for each one a .npy file that contains the features

STEP 7: Create Labels of our classes

Load the Labels for the Video Activities we have data for and will use to train our LSTM

Make it so 1 or 0 to tell if belongs to or does not belong to a class...uses LabelBinarizer part of scikit (read https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)

STEP 8: Define Network with LSTM to train for Video Activity Recogntion

Layer 2 = LSTM and also

IMPORTANT: what you are really defining is what ONE LSTM cell looks like ---you will have SEQUENCE_NUMBER cells (as defined above)

512 #units

dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.

recurrent_dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.

Layer 3= fully connected (Dense) layer - see documentation and keras doc. Number of output values=256

Layer 4 = Dropout

Layer 5= finaly fully connect softmax decision layer --output has # classes/labels values.

STEP 9: Specify the loss function & optimizer function & metrics you wish to calculate for training

compile: Specified loss as categorical_crossentropy which is the cross-entropy loss between true labels and predicted labels.

compile: Specified optimizer as rmsprop which is used to adjust weights as training as described here

compile: Specified metrics to caclulate accuracy as the top_k_categorical_accuracy (see better documentation here) - default value of k=5, calculates the top-k categorical accuracy rate, i.e. success when the target class is within the top-k predictions provided.

read here for some other metric options and here

here is a different choice for metrics=['accuracy', mean_pred])

STEP 10: Specify files to use in testing (test_file) and in training (train_file). Create function make_generato r which will take in an array of files and randomly select from it building out the list as the .npy feature files for each avi in this set of files

To give an idea of the balance between train and testing - for training on the "ApplyEyeMakeup" there are 102 videos in the training dataset. In the testing dataset there are 44 videos.

This is what the train_list looks like:

STEP 11: Create Training dataset (train_datset) and Validation dataset (valid_daataset)

Create Datasets from the list of training feature files (train_list step 10) and validation feature files (test_list step 10)

create the datasets: train_datset and valid_dataset. Note we are batching 16 & prefetching. Also,each training sample: has SEQUENCE_LENGTH (40--see above where defined) feature vectors of length 2048 and a Label associated with it.

batch = the number of training samples to work through before the model's internal parameters are updated.

Rather than setting the amount to prefectch manually used tf.data.experimental.AUTOTUNE which will prompt the tf.data runtime to tune the value dynamically at runtime (does not publish how this is done)

STEP 12: Train your LSTM

Creating a callback function pointing to tf.keras.callback.TensorBoard which will store information (checkpoints,etc) that is used with TensorBoard a visualization tool used in TensorFlow to show the trends in training and validation --will perform plots of data etc.

Epoch = controls the number of complete passes through the training dataset..

Calling model.fit:

passing train_dataset & valid_dataset

specifying will run for 17 Epochs (a stopping condition)

specifying the callback function to call at certain points (trigger points)

Ignore the warning --but, notice that with each Epoch the accuracy is improving

@Epoch 1: accuracy = 0.0625 @Epoch 5: accuracy = 0.6662

STEP 13: SAVE your model --using same model.save as found in the CNN example

STEP 14: CONVERT LSTM model to tflite--- you have to investigate this --https://www.tensorflow.org/lite/convert/rnn (THERE are limitations)

SUPPOSEDLY THE FOLLOWING IS NO LONGER TRUE AND TFlite Conversion of LSTM is possible ---until we know for sure leaving the info below.

The TensorFlow Lite interpreter currently implements a subset of TensorFlow operations, meaning some model architectures cannot immediately be converted due to missing operations.

Some RNN-based architectures are affected by this.

Option:

ROADMAP saying LSTM will come 2019 Possible OPTION to get TFlite to work???? https://www.youtube.com/watch?v=gHN0jDbJz8E&feature=youtu.be --only for limited ops or for flow-control like RNNs?? OR LOOK at : For models with LSTM/RNNs, you can also try the experimental API OpHint to convert

STEP 7:Run prediction on a SINGLE video clip (from file) --mimicing parts of what you need to do in a run-time scenario:

STEP 7.1 capture SEQUENCE_LEGTH images in video clip & 7.2 - process each image & 7.3 call Feature Extraction CNN

VARIOUS Printouts showing steps in - 7.1,7.2& 7.3

7.4 convert the above array of SEQUENCE_LENGTH tensors to a tensor

7.5 run the tensor input through our LSTM model for prediction