Efficient Long-Text Understanding with Short-Text Models

NLP IL F2F Meetup at Intuit

A recap of the talk on efficient long-text understanding with short-text models, which covered the challenges of processing long sequences with transformer-based language models, the SLED approach for leveraging short-text pretrained models, and the analysis of SLED’s performance and limitations.

meetup

nlp

Session Video

Efficient Long-Text Understanding with Short-Text Models

Paper

Efficient Long-Text Understanding with Short-Text Models

Abstract:

Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pre-training from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pre-training step.

Speaker

Maor Ivgi
- PhD candidate in Tel Aviv university,
- Maor is an NLP researcher and entrepreneur. He has vast experience in implementing state-of-the-art deep learning models for real-world use cases. He received his masters in Computer Science at Tel-Aviv University advised by Prof. Jonathan Berant, focusing on NLP models’ Robustness. As a Ph.D. candidate at Prof. Berant’s lab, his research is focused on long-range reasoning in large language models.

Slides

Efficient Long-Text Understanding with Short-Text Models

NLP seems to have reached new level of maturity for use in Industry
- c.f. Attention is all you need
- c.f. BERT pre-training of deep bidirectional transformers for language understanding

Transformers Quadratic dependency limits

Transformers have issues with long texts:
- self attention is O(n^2)
- cross attention is O(nk)
Efficient LLM papers are:
Hard to understand,
Hard to generalize (due to platform specific engineering tricks)
Expensive to reproduce
Inference run into Memory is an issue
Training is often on beginning of document so does not see the end
Self Attention is has a limited window size.

SLED’s Approach
- Assume locality of information: “In an encoder-decoder architecture, the encoder can effectively contextualize input tokens with local context only, leaving long range dependency to be handled by the decoder.”
- Split text into short fixed length overlapping chunks of text (short contexts).
- Prepend the prefix/prompt to each chunk
- The decoder will need to put it all together.

SLED is Competitive with short text models

this is a great slide!
it summarizes lots of info
SLED’s Analysis
- Contextual encoding is crucial
- Cheating is not enough
- The is real benefit in fusion

what is Cheating?

Quantifying SLED’s benefits using relative improvement.

\text{Relative Improvement} = \frac{Score(SLED)-Score(Bart)}{Score(Bart)}

Limits & Future Work
- Long outputs are still a constraint
- No explicit global contextualization
- No explicit global positional information
- No applicable for decoder-only architecture
- (Corrective) pre-training is expected to help

Takeaways
- Individual pieces of information are localized
- Fusioin in decoder works
- SLED does well on long range tasks.

Main points

They point out that the encoder can usually do a adequate job of understanding the input by looking at local context. Mostly a window with a few surrounding sentences. It uses this to create encode the input into a compact representation we call the state. The decoder will then be leverage the compression with “adaquate” encodings to efficently retrieve results from much longer contexts during inference on different tasks.

Citation

BibTeX citation:

@online{bochman2015,
  author = {Bochman, Oren},
  title = {Efficient {Long-Text} {Understanding} with {Short-Text}
    {Models}},
  date = {2015-11-01},
  url = {https://orenbochman.github.io/posts/2023/01-11-nlp-il-meetup-intuit/talk2.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2015. “Efficient Long-Text Understanding with Short-Text Models.” November 1. https://orenbochman.github.io/posts/2023/01-11-nlp-il-meetup-intuit/talk2.html.