Skip to main content

Introduction to Natural Language Inference (NLI)

What is Natural Language Inference?

NLI Intro

Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), is a key task in Natural Language Processing (NLP) that focuses on understanding the logical relationship between two sentences. Specifically, NLI involves determining whether a given hypothesis can be inferred from a premise. There are three possible outcomes for this task: entailment, where the hypothesis logically follows from the premise; contradiction, where the hypothesis is clearly false based on the premise; and neutral, where the hypothesis cannot be determined as either true or false based on the premise alone.

NLI Example

For example, consider the premise: "A man is playing the guitar." If the hypothesis is "A man is making music," the relationship is one of entailment, because playing a guitar typically results in making music. Conversely, if the hypothesis is "The man is sleeping," this would be a contradiction, since the premise clearly describes the man performing an action that requires him to be awake. Finally, if the hypothesis is "The man is in a band," this would be neutral, as playing the guitar doesn’t necessarily imply the man is part of a band, nor does it contradict that possibility.

NLI is crucial because it tests a model’s ability to perform deeper reasoning about language beyond simple word matching or shallow syntactic patterns. Models need to understand not only the meanings of individual words but also how different combinations of words change the meaning of a sentence. For instance, the premise "The cat is sitting on the mat" entails the hypothesis "There is a cat on the mat," but it does not entail "The cat is sitting under the mat", even though both sentences share similar words. The second hypothesis subtly shifts the focus, requiring a model to understand positional relationships.

In more complex cases, NLI involves making inferences based on context and real-world knowledge. For instance, consider the premise "John left the house early to avoid traffic." If the hypothesis is "John is concerned about being late," the relationship would be entailment, as avoiding traffic usually implies concern about arriving on time, even though the premise does not explicitly state it. This requires models to have an understanding of common human motivations and behaviors, as well as the ability to perform reasoning beyond simple text matching.

In summary, NLI tasks go beyond the surface level of understanding language and require models to capture deeper semantic and logical relationships. They are used as benchmarks to test how well systems can handle inferential reasoning, and they have broad applications in improving the robustness and accuracy of various NLP tasks like question answering, machine translation, and dialogue systems.

A brief history

RTE

The modern development of NLI began with the launch of the Recognizing Textual Entailment (RTE) challenge [1], which started in 2004. This challenge was designed to standardize the evaluation of systems that could recognize entailment relationships between two pieces of text. It shifted the focus from rigid, logic-based systems toward practical, real-world language understanding.

The first RTE challenge was organized by the PASCAL network in 2004, and it provided benchmark datasets for evaluating NLI models. These challenges revealed that natural language understanding required capturing not just syntactic structure but also commonsense reasoning and world knowledge. RTE tasks typically presented a sentence pair and asked whether the truth of the hypothesis was inferred from the premise.

Throughout the RTE challenges (RTE-1 through RTE-7), the field began moving toward more sophisticated machine learning methods. By the mid-2000s, these systems leveraged support vector machines (SVMs), decision trees, and other traditional machine learning classifiers trained on hand-engineered features derived from lexical overlap, word embeddings, and dependency parses. However, these approaches often struggled with the nuances of natural language, and models from this period were constrained by the difficulty of capturing semantic and world knowledge through rules or shallow statistical models.

Deep Learning

The field of NLI saw a major breakthrough with the advent of deep learning techniques. Recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and convolutional neural networks (CNNs) began to outperform traditional machine learning models. These architectures could better capture semantic meaning by learning from large amounts of data in an end-to-end manner.

Two key datasets further accelerated NLI research during this time:

  • Stanford Natural Language Inference (SNLI) Corpus (2015) [2]: SNLI, introduced by Bowman et al. in 2015, became a landmark dataset in NLI. It was much larger than previous RTE datasets, providing over 570,000 sentence pairs, and it enabled training and testing of deep learning models more effectively. SNLI became the go-to dataset for the development of neural models.
  • Multi-Genre NLI (MultiNLI) Corpus (2018) [3]: The MultiNLI dataset, created by Williams et al., expanded on SNLI by providing sentence pairs from multiple genres of written and spoken text. This aimed to better evaluate how models could generalize across different domains and styles of language.

SNLI

A sample of text hypothesis pairs from the SNLI dataset

Models such as attention-based LSTMs began to dominate the leaderboards. Attention mechanisms allowed the models to focus on specific parts of the premise when processing the hypothesis, improving the ability to detect subtle patterns.

The Transformer Era

The introduction of Transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers), revolutionized NLI. BERT and similar models such as GPT, RoBERTa, and T5 shifted the paradigm by pretraining on massive amounts of text and fine-tuning on downstream tasks like NLI. These models, based on the Transformer architecture, captured deep contextual representations of language, significantly improving performance on NLI benchmarks.

Key developments during this period include:

  • BERT (2018): Introduced by Devlin et al., BERT's ability to model bidirectional context led to state-of-the-art results in NLI. BERT was pretrained on vast amounts of unannotated text and fine-tuned for specific tasks like NLI.
  • GPT (Generative Pretrained Transformer) Series (2018-present): GPT, and later versions like GPT-2 and GPT-3, demonstrated that large-scale, autoregressive models could also handle NLI tasks by generating context-aware representations of text.
  • RoBERTa (2019): RoBERTa, a variation of BERT, further improved NLI performance by tweaking pretraining objectives and increasing data scale.

With these models, NLI systems achieved human-like accuracy on benchmarks like SNLI and MultiNLI, but challenges remain in areas like adversarial examples, robustness to out-of-domain data, and commonsense reasoning.

NLI and TrueState

Our Platform

The field of NLI has come extremely far in the past 20 years, and is virtually unrecognizable when compared to its humble beginnings. The development of transformer models has brought about near human level performance, but with this comes challenges of managing infrastructure to host and deploy these large models, finding new ways to apply NLI techniques, and keeping up to date with the latest developments in the field.

At TrueState, we have developed a suite algorithms that leverage NLI models and can be applied to a variety of use cases that implement the state-of-the-art. As an applied AI research company, we are keeping up to date with the latest models and technology to ensure that our platform includes this new functionality, to allow our users to worry about their own implementation, and not have to become experts in Natural Language Processing, while still being empowered to leverage modern breakthroughs.

Users can apply NLI technology on the TrueState platform using our Hierarchy Classification, Universal Classification and Tagging actions in batch flows, and in the Decision Step step type in live flows.

References

  1. Dagan, I., Glickman, O., Magnini, B. (2006). The PASCAL Recognising Textual Entailment Challenge. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds) Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment. MLCW 2005. Lecture Notes in Computer Science(), vol 3944. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11736790_9
  2. Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In L. Màrquez, C. Callison-Burch, & J. Su (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 632-642). Association for Computational Linguistics. https://aclanthology.org/D15-1075. https://doi.org/10.18653/v1/D15-1075
  3. Williams, A., Nangia, N., & Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 1112-1122). Association for Computational Linguistics. http://aclweb.org/anthology/N18-1101
  4. Laurer, M., van Atteveldt, W., Casas, A., & Welbers, K. (2024). Building efficient universal classifiers with natural language inference. arXiv preprint arXiv:2312.17543. https://arxiv.org/abs/2312.17543