Deep Multi-Input Models Transfer Learning for Image and Word Tag Recognition

A multi-models deep learning approach for image and text understanding

Yuefeng Zhang, PhD

With the advancement of deep learning such as convolutional neural network (i.e., ConvNet) [1], computer vision becomes a hot scientific research topic again. One of the main goals of computer vision nowadays is to use machine learning (especially deep learning) to train computers to gain human-level understanding from digital images, texts, or videos.

With its widespread use, ConvNet becomes the de facto model for image recognition. As described in [1], generally speaking, there are two approaches for using ConvNet for computer vision:

  • Training a new ConvNet model from scratch
  • Using transfer learning, that is, using a pre-trained ConvNet model

As shown in the following diagram, a ConvNet model consists of two parts: a convolutional base and a fully connected classifier.

Figure 1: Typical scenario of ConvNet transfer learning.

The ConvNet transfer learning can be further subdivided into three methods:

  • Method 1: Feature extraction without image argumentation [1]This method uses a pre-trained convolutional base to convert new images into arrays such as Numpy arrays (they can be saved into files if necessary) first and then use those array representations of images in memory to train a separate new classification model with randomly initialized weights.
  • Method 2: Feature extraction with image argumentation [1]This method builds a new model with the pre-trained convolutional base as the input layer, freezes the weights of the convolutional base, and finally adds a new output classifier with randomly initialized weights.
  • Method 3: Fine tuning [1]This method does not use the whole frozen pre-trained convolutional base. It allows to unfreeze some of the top layers of a frozen pre-trained convolutional base so that those unfrozen top layers can be jointly trained with a new fully connected classifier.

Method 2 is used in this article for multi-input models transfer learning.

The main idea behind transfer learning can be used for not only supervised ConvNet but also other deep learning algorithms such as the unsupervised word embedding models for natural language processing (NLP)[4].

There are two popular pre-trained word embedding models: word2vec and GloVe [3]. Like the word2vec-keras model used in [4], these pre-trained word embedding models are usually combined with other supervised deep learning algorithms such as the recurrent neural network (RNN) LSTM for NLP such as text classification [4].

A ConvNet model or a NLP model (e.g., combination of word embedding with LSTM) can be used separately to solve many interesting problems in computer vision and NLP. As to be shown in this article, these different types of models can also be combined in various ways [1] to form more powerful models to address more challenging problems such as insurance claim process automation that require not only the capability of image recognition but also natural language (e.g., texts) understanding.

This article uses an interesting, but challenging dataset in Kaggle, Challenges in Representation Learning: Multi-modal Learning [2], to present a new multi-input transfer learning model that combines two input models with a fully connected classification layer for both image recognition and word tag recognition at the same time.

The main idea behind the new multi-input model is to translate the problem of image and word tag recognition into a machine learning classification problem, that is, determining whether or not a given image matches a given set of word tags (0-No, 1-Yes).

1. Data Preparation

After the Kaggle dataset of image files and word tag files [2] has been downloaded onto a local machine, the code below can be used to build and shuffle the lists of image file names and related word tag file names. There are 100,000 image files and 100,000 corresponding word tag files in the dataset for training purpose.


In order to train the new multi-input model on a laptop within a reasonable amount of time (a few hours), I randomly selected 2,000 images and corresponding 2,000 word tag files for model training for this article:


The code below is to load the 2,000 image tag file names and the corresponding 2,000 word tags into Pandas DataFrame:


Similarly to [4], a textual data preprocessing procedure is included in the Jupyter notebook [5] to perform minimum data preprocessing such as removing stop words and numeric numbers in case it makes a significant difference:

As described in [4], the impact of textual data preprocessing is insignificant and thus the raw word tags without preprocessing are used for model training in this article.

2. Architecture of Multi-Input Models Transfer Learning

As shown in the diagram below, the new multi-input transfer learning model uses the pre-trained ConvNet model VGG16 for receiving and handling images and a new NLP model (a combination of the pre-trained word embedding model GloVe and Keras LSTM) for receiving and handling word tags. These two input models are merged together first and then combined with a fully connected output classification model that uses both of the image recognition model output and the NLP model output to determine whether or not an input pair of an image and a set of word tags is a match (0-No, 1-Yes).

Figure 2: The architecture of the new deep learning model for multi-input models transfer learning.

3. Transfer Learning for Image Recognition

As shown in Figure 2, the new multi-input transfer learning model uses the pre-trained ConvNet model VGG16 for image recognition. The VGG16 model has already been included in the Keras library. The following code from [1] is used to combine the VGG16 convolutional base with a new fully connected classifier to form a new image recognition input model:


4. Transfer Learning for Text Classification

As shown in Figure 2, the new multi-input transfer learning model uses the pre-trained word embedding model GloVe [3] for converting word tags into compact vectors. Once the GloVe dataset [3] has been downloaded to local machine, the following code from [1] can be used to load the word embedding model into memory:


As it can be seen in Figure 2, the GloVe word embedding is combined with Keras LSTM to form a new NLP input model for predicting/recognizing word tags:


5. Combining Multi-Input Models with Fully Connected Classifier

Once the new image recognition input model and the new NLP input model have been created, the following code can combine them with a new output classifier into one multi-input transfer learning model:


As described in [1], both the pre-trained VGG16 convolutional base and the GloVe word embedding layer must be frozen so that the pre-trained weights of those models will not be modified during the new multi-input model training:


6. Multi-Input Model Training

The original Kaggle training dataset includes only the correct pairs of images and corresponding word tags. Each of such correct pairs is labeled as 1 (match) in this article (also see the code below). In order to create a balanced dataset, the following code creates 2,000 incorrect pairs of images and word tags in addition to the existing 2,000 correct pairs of images and word tags. For simplicity, this is achieved by pairing each (say Image i) of the selected 2,000 images with the word tags of next image file (i.e., word tags of Image i+1).


There are 4,000 pairs of images and word tags in total, 2,000 correct pairs and 2,000 incorrect pairs.

Each of the image word tags needs to be encoded as an integer, and each list/sequence of word tags needs to be converted into a sequence of integer values before the word tags can be consumed by the word embedding model. This is achieved as follows by using and modifying the code in [1]:


The resulting image and word tag training datasets are converted into Numpy arrays and shuffled for model training:


The new multi-input model is compiled and trained as follows with only 30 epochs and 4,000 balanced pairs of images and word tags:


7. Model Prediction

As shown below, the testing dataset in [2] includes 500 images and each image is associated with two sets of word tags:

Given an image in the testing dataset, the new multi-input transfer learning model needs to be able to predict which of the given two sets of word tags matches the image.

The following code is to load the testing images into memory:


The testing word tags are converted into sequence of encoded integer values as follows:


The resulting Python arrays of images and word tags are then converted into Numpy arrays and fit into the trained model for prediction:


The following table shows the first 10 prediction results:

The following image is Image 363.png in the testing dataset:

The two associated sets of word tags are as follows:


The model predicts:


As another example, the following is Image 406.png in the testing dataset:

The following are the associated two sets of word tags:


The model predicts:


The results above show that even though the new multi-input transfer learning model is trained with only 4,000 pairs of images and word tags and 30 epochs, the model performed quite reasonably well in terms of accuracy. The model performance can be further improved by training the model with more epochs and/or more pairs of images and word tags.

Summary

This article presented a new multi-input deep transfer learning model that combines two pre-trained input models (VGG16 and GloVe & LSTM) with a new fully connected classification layer for recognizing images and word tags simultaneously.

The key point of the new multi-input deep learning method is to translate the problem of image and word tag recognition into a classification problem, that is, determining whether or not a given image matches a given set of word tags (0-No, 1-Yes).

The challenging public dataset in Kaggle, Challenges in Representation Learning: Multi-modal Learning [2], was used to train and evaluate the new model.

The model prediction results demonstrated that the new model performed quite reasonably well with limited model training (only 30 epochs and 4,000 pairs of images and word tags) for demonstration purpose.

The model performance can be further improved by training the model with more epochs and/or more pairs of images and word tags.

A Jupyter notebook with all of the source code is available in Github [5].

References

[1] F. Chollet, Deep Learning with Python, Manning Publications Co., 2018

[2] Challenges in Representation Learning: Multi-modal Learning

[3] J. Pennington, R. Socher, C.D. Manning, GloVe: Global Vectors for Word Representation

[4] Y. Zhang, Deep Learning for Natural Language Processing Using word2vec-keras

[5] Y. Zhang, Jupyter notebook in Github

Favorite

Leave a Comment