Padding Sequences with Simple Min/Max/Mean Document Encodings

Snapshots from Weakly Supervised Learning (O’Reilly, 2020)

Russell Jurney

In this post, we’ll cover how to pad text sequences with three values that perform surprisingly well: the min/max and mean. These are a simple form of document encoding that characterize the approximate meaning of the entire document. They tend to perform better than imputing a single static value across all documents.

This is part 1 of a series of posts where I will demonstrate techniques from weakly supervised learning using Stackoverflow questions from the Stack Exchange Data Dump as part of writing my forthcoming book Weakly Supervised Learning (O’Reilly, 2020). These techniques include transfer learning, semi-supervised learning, distant supervision and weak supervision. In these posts I’ll be demonstrating technical recipes, while the book will be a cohesive overview of the field by solving real-world problems.

Most of you can skip ahead, but if you’re new to deep learning and natural language processing, you need an understanding of text embeddings and sequence padding to move forward. In short, in this post, we’re encoding words in documents according to the meaning of the words (this is what Word2Vec does) and making them all the same length so we can use linear algebra to process them. This post is about what value(s) to use to make short documents the same length as longer documents.

Check out the post The Amazing Power of Word Vectors by Adrian Colyer, and then come back and start here.

The performance of different methods of imputation when padding encoded documents is described in the paper Representation learning for very short texts using weighted word embedding aggregation which was [very helpfully] referenced from this Stack Overflow answer. For more on imputation in general, check out this post by Will Badr.

User D.W. writes:

One simple technique that seems to work reasonably well for short texts (e.g., a sentence or a tweet) is to compute the vector for each word in the document, and then aggregate them using the coordinate-wise mean, min, or max.

The paper uses the mean, max and min/max as baselines and measures performance on a semantic similarity task compared to the author’s algorithm.

Source: Representation learning for very short texts using weighted word embedding aggregation, De Boom, Van Canneyt et al, 2016 Source: Representation learning for very short texts using weighted word embedding aggregation, De Boom, Van Canneyt et al, 2016

While not state of the art at the time, min/max/mean are easy to compute and can be an excellent baseline — which is why they were used for the book (although I ended up going with a simple Conv1D model as a baseline, which uses its own embedding).

Note that padding with alternating min/max is something that keras.preprocessing.sequence.pad_sequences can’t do, as it accepts only a single float/string as an argument for the pad value. Keras has many limitations like this one because its goal is simplicity and accessibility, not state of the art performance. I‘ve found that as I’ve gotten deeper into deep learning and natural language processing, the limitations of keras have driven me towards keras internals, raw Tensorflow and PyTorch when I’m refining models.

We use Gensim’s models.Word2Vec to encode our tokenized text. Nothing new here. First we check if the model is loaded and if not, we recreate and save it. Then as a sanity check, we test the embedding model out by predicting lookalikes in terms of semantic similarity for an example word. We apply the model to the documents one by one, token by token, creating a new list of numpy.arrays, one for each document. Note: if you’re new to Word2Vec, check out this post by Suvro Banerjee.

Encoding tokenized documents into Word2Vec a dense vector representation

We will now compute a position-wise minimum, maximum and mean, concatenate the first two values, and use either min/max or mean to pad any documents with less than MAX_LENGTH words. We will simultaneously truncate any documents with more than MAX_LENGTH words. If the row has an odd number of values and the min/max padding extends the document past MAX_LENGTH, we chop off the extra value to make it even.

Manually padding with custom values: min/max or mean

Tada! Not a bad recipe for cooking up something simple 🙂 I used this with scikit-learn’s RandomForestClassifier to create a baseline model, but found that I needed something non-linear as a baseline.

In the next post, we’ll look at how I used CuPy and CUDA to speed up reshaping this encoded/padded data from a 3D vector to a long 2D vector for consumption by a random forest model.

Russell Jurney is a machine learning and visualization consultant at Data Syndrome where he specializes in weakly supervised learning (doing more with less data), end-to-end machine learning product development, data labeling, agile data science coaching and predictive lead generation.


Leave a Comment