"Meet in the Middle: a Unified Pre-Training and Inference Paradigm for Language Models" is an article that proposes a new pre-training paradigm for language models that uses the training data more efficiently by leveraging both the prefix and the suffix while still maintaining the autoregressive nature of LMs. A new method called "Meet in the Middle" (MIM) involves training both a forward and a backward model and encouraging them to agree, allowing each LM to benefit from the context provided by the other LM, which improves data efficiency and consistency.
Language models (LMs) have revolutionized the field of natural language processing (NLP) and are widely used for various assisted authoring tasks such as text summarization, code completion, and paraphrasing. However, pre-training LMs typically focuses on optimizing the model’s ability to predict the next token given the previous tokens, without considering subsequent tokens (suffix). This research paper proposes a novel approach to pre-training LMs that leverages both prefix and suffix information, while maintaining the autoregressive nature of LMs.
The proposed approach, called “Meet in the Middle” (MIM), involves training two LMs, one that processes tokens left-to-right and another that processes tokens right-to-left. These LMs are trained jointly on a large corpus of text using a combination of the standard language modeling loss and an agreement regularizer. The agreement regularizer encourages the two models to agree on the probability distribution over next tokens, thereby making the models more consistent and improving data efficiency.
Once pre-training is complete, the forward model can be used as a drop-in replacement for existing autoregressive LMs, while the backward model can be used for related tasks such as infilling. The proposed inference procedure for infilling takes advantage of context from both sides and the tendency of the two models to agree, resulting in better quality and latency than the state of the art.
The authors evaluate the effectiveness of MIM for pre-training LMs on different domains and tasks, using public code and language data to pre-train LMs of different sizes. They show that MIM outperforms other baselines in terms of both perplexity and task-specific evaluation metrics. They also conduct ablation studies to show the effectiveness of their main proposals during training and inference.
Read the paper here.