Data Science in International Development. Part I: Working with Text

Part I: Working with Text

Co-written by Kelsey Barton-Henry.

Today, headlines are filled with claims about the power of Artificial Intelligence (AI) to do things only humans could do before. Recognizing objects in images, responding to voice queries, or interpreting complex text instances, to mention a few. But how do AI applications work? What are the AI solutions being used in International Development and Security? In this post, we summarize some of the basic techniques for computers to process and react to human language using Machine Learning, using real-world scenarios from several of our projects at AKTEK.

Image: AKTEK Ltd

Development actors often deal with large amounts of text and speech data. This data carries information that is key to understanding conflict and social dynamics. Social media, news, and other text storage and sharing mechanisms have produced even more data in recent years. It is therefore more important than ever to be able to process text quickly and accurately. Artificial intelligence can help.

The area of artificial intelligence that works on processing text is called Natural Language Processing (NLP). NLP techniques are the basis of many well-known applications including spam filters, internet search engines, conversational bots, recommendation systems, customer service applications, and machine translations. One of the core techniques powering all of these tools is called Text Classification.

Text Classification is a part of Machine Learning in which the computer learns to automatically assign a given text to a specific category with high accuracy, something it can then do repeatedly without further human intervention.

At AKTEK, we have leveraged NLP and text classification to create technological solutions that work in the field: detecting potentially harmful extremist content online, or helping to expose disinformation spread online.

Detecting Online Extremism

Our flagship NLP project dealt with detecting online extremism in a country in Europe. We built, trained, and deployed a detector that could accurately identify extremist text online.

To build it, we worked with research experts in the local context and languages. Step one consisted of the researchers reviewing and labeling different samples of text that we extracted from selected online public platforms. They studied the content and labeled each piece of text as extremist or not extremist, based on the local political context.

That supervision provided us with enough labeled data to power a machine learning engine. We took this data and devised an ensemble of algorithms to find patterns in the data and learn to replicate and extrapolate the decision-making process of the researchers. We therefore had detector that could ingest text from several online public platforms and provide a score linked to the probability that the text is aligned with different types of extremism and hate speech.

This AI engine allowed us to automatically process millions of comments from the online sphere, something that would have been impossible to do manually. Together with our expert researchers, we built a data-driven picture of extremism in the online space, to support policy-making and security.

Fighting Disinformation

Following an analogous process, we have developed a prototype to detect and expose disinformation online. This prototype is designed and trained on labeled fact-checked news articles. That way, it is able to find common linguistic patterns in the real and fake texts that are indicative of real news or disinformation.

In this case, we found that text alone was not enough to detect disinformation: we also provided our algorithm with data about where the article was published, what was the article’s source, who were the authors, and how the article has been shared on social media (and by whom).

The inclusion of this information allows us to analyse common patterns of orchestrated campaigns in social media and the presence of bots. With that combination of data, our prototype is already able to reach accuracies similar to some state-of-the art AI studies published in the literature for specific scenarios.

The aim of this project is to provide a detector that can flag disinformation online in near real-time. We aim to help journalists, publishers, policy makers, and especially media consumers to better understand the nature of disinformation and to protect themselves against disinformation campaigns.

For now, this prototype has only been tested on a small sample of labelled data. It is too soon to estimate how the prototype’s performance will extrapolate to the real-world, but we are taking steps in that direction, looking for ways to compile more fact-checked articles and to include an automated fact-checking process across sources.

Both of these projects are based on text classification, natural language processing, and other machine learning techniques. But how do they work?

How it works: Natural Language Processing and Text Classification

How do computers process text?

The machine learning techniques that are feeding most of the powerful AI applications nowadays are, well… mathematical algorithms. This means, at their core, these technologies are dealing with numbers. Image (or video) recognition processes pixel color brightness as the numerical inputs; sound signals are digitized and transformed to amplitudes and frequencies for speech-to-text applications. How we transform text into numerics is one of the key steps in any of the projects we talk about in this entry. This process is called text vectorization.

Text Vectorization: from Bag of Words…

The most basic — though often powerful — vectorization techniques are based on counting word frequencies in text. Those numbers can then be used to fill a matrix, which gives us the numerics we need to proceed further. Afterall, it is intuitive to think that when doing text classification the appearance and repetition of certain words have a strong correlation to the category a text belongs to, right?

Of course, this is a huge oversimplification. Indeed, counting words this way might result in over-weighing some words that are mentioned many times, but that carry no information for the text classification at hand. This is why there are several ways of ‘counting’ words in a piece of text. In some cases, ‘stopwords (i.e. very common words, such as ‘the’, ‘a’, or ‘is’) are removed to improve the performance of the algorithm.

These types of words do not contribute significantly to text meaning and therefore do not help to determine its category, so it is better they are removed. In other cases, these common words are down weighted to reduce their importance. For an example of a more sophisticated vectorization method, see Term Frequency Inverse Document Frequency.

What if a word has been misspelled? How do we account for multiple declinations of the same word? There are several methods to deal with these types of problems , such as lemmatization and stemming. Another way to deal with this is by counting character sets (this is, letters), and not just words.

You might notice that breaking down sentences using characters and words as text units will make us lose part of the contextual information. To partially overcome this we count not just words and characters, but combinations of 2, 3 or more. These combinations are called n-grams.

…to word embeddings

Yet, the challenge for complex classification problems is that even using n-grams, we may still lose part of the contextual meaning. Computationally, these methods can also be challenging, as they lead to huge matrices that must be processed by computers, and are difficult to maintain in memory.

Worse, those huge sparse matrices (matrices with mostly 0s) make it hard for algorithms to learn and avoid overfitting. This is called the curse of dimensionality. Therefore, we must carefully implement a very robust statistical framework to control the effects of these issues. For these more complex tasks, we can move to the word embedding methods in order to improve performance.

Word embeddings were developed to overcome some of the limitations described above, and to capture more of the semantic information and the relationships between words in language. In order to construct word embeddings, we first mathematically process massive amounts of unsupervised text data. We then let an algorithm (usually a narrow neural network) learn to predict if a word belongs to a given context (or vice versa).

At the end of this process, the computer will still not understand the meaning of a word as humans do. But through this statistical process of repeatedly looking at text, it will “learn” to vectorize words in a meaningful way. Semantic and syntactic relations will be present on this new mathematical hyperspace, check for the usual analogies that can be obtained, some are really amazing.

Image: AKTEK Ltd

Additionally, word embeddings are constructed so that the numerical matrices that we have to process are no longer as massive and sparse as before. While in bag of words models, a full piece of text is transformed into a sparse one-dimensional vector (with hundreds of thousands of dimensions or more), here, a word is transformed into a dense one-dimensional vector (with just a few hundred dimensions).

Thus, a piece of text (multiple words) can be arranged for instance as a matrix, a two-dimensional numerical structure. This structure then opens up the possibility to apply deep learning techniques for text classification, for which the sequential and contextual structure of the text makes it a suitable playground to build things like translation systems, chatbots, or extremism detectors.

Of course, this is not the end of the story: paragraph embeddings, character embeddings, ULMFiT, ELMo… and impressive new techniques are being researched as you read this article.

How do we train computers to classify text?

Once vectorization is completed, we can transform text into a form that the computer can process further. But even after transforming text into numerics, we still need to teach the computer to perform the specific text classification in which we are interested.

Supervised Learning

For this step, we need a smart sample of text pieces (posts, or documents or articles, depending on the task) that is labelled by analysts. The analyst team manually assigns these text pieces to their corresponding category, and we consider this assignment to be their ground truth label for each. Together with the vectorized text, the ground truth labels form the training set, and with that we are ready to classify text.

Multiple algorithms exist to perform this classification, all with their advantages and disadvantages. At a very high level, algorithms are mathematical recipes that operate on the text-vectors and output a probability that the text belongs to a specific class. To be able to classify new text as accurately as possible, algorithms work to minimize the error between their predictions on the training set and the true classifications of those same texts.

Image: AKTEK Ltd

The process of that minimization is another rich mathematical field by itself, and it is usually conducted through differentiation, algebra, and other powerful numerical methods. The process of ingesting labeled data and minimizing the prediction error is called training. This process allows the algorithm to learn the patterns in the data that determine if an article belongs to a specific category (such as if a post is extremism or not).

After training, when the algorithm sees a new piece of text, it can then make an informed prediction of the category to which it belongs without any human intervention. At this stage, the computer has learned, and a new prediction can be made in near real-time, in a process which is now easily scalable to massive amounts of data.

Image: AKTEK Ltd

A robust statistical framework

In all parts of model development and training, the most important aspect is to have a well-defined statistical framework. In order to back up every model choice: which text pre-processing to do (lemmatization, stemming, stop words), which specific vectorization to use (BoW, tf-idf, embeddings), which classifier or ensemble to chose (simpler algorithms or sophisticated deep learning constructions) or how to tune all those, we need a continuous assessment of how well the model is expected to perform once it is deployed on new data. This model selection is usually done via what is called validation.

We can then test how well our final algorithm finds patterns in the data by showing it new data it has never seen. This new data must have been kept separate from the training set, in what is called the test set. We pass this ‘new’ data through the algorithm, and compare the algorithm’s predictions to the true outcome values obtained again from the analysts. This give us a pretty robust estimate of how accurately our algorithm will perform when it is applied to new data in a real-world application.


Each of the techniques we have summarized here have a rich literature behind them. We encourage you to investigate them further; the field is certainly worth a deeper look! All of the techniques mentioned here, and more, are powering many real world applications: search engines, translation systems, chatbots, extremism detectors.

When applied in collaboration with research experts, data science is providing answers and solutions to some of the most complex questions currently being faced in international development and security. This entry only covers a small sample of text-based solutions — in future entries, we will describe approaches with other types of data.


Leave a Comment