Efficient Long-Text Understanding with Short-Text Models

NLP IL F2F Meetup at Intuit

A recap of the talk on efficient long-text understanding with short-text models, which covered the challenges of processing long sequences with transformer-based language models, the SLED approach for leveraging short-text pretrained models, and the analysis of SLED’s performance and limitations.
meetup
nlp
Author

Oren Bochman

Published

Sunday, November 1, 2015

Modified

Monday, May 18, 2026

Keywords

NLP, Intuit, Meetup, Long-Range Reasoning, Efficient Long-Text Understanding, Speech Recognition, SCROLLS, SLED

Session Video

Efficient Long-Text Understanding with Short-Text Models

Paper

Abstract:

Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pre-training from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pre-training step.

Speaker

  • Maor Ivgi
    • PhD candidate in Tel Aviv university,
    • Maor is an NLP researcher and entrepreneur. He has vast experience in implementing state-of-the-art deep learning models for real-world use cases. He received his masters in Computer Science at Tel-Aviv University advised by Prof. Jonathan Berant, focusing on NLP models’ Robustness. As a Ph.D. candidate at Prof. Berant’s lab, his research is focused on long-range reasoning in large language models.

Slides

Efficient Long-Text Understanding with Short-Text Models

Efficient Long-Text Understanding with Short-Text Models

NLP Papers

NLP Papers
  • NLP seems to have reached new level of maturity for use in Industry
    • c.f. Attention is all you need
    • c.f. BERT pre-training of deep bidirectional transformers for language understanding

Model Timeline

Model Timeline

Q&A challenges

Q&A challenges

Transformers - Good on short text NLU

Transformers - Good on short text NLU

Long Text NLU Fail

Long Text NLU Fail

Transformers Quadratic dependency limits

Transformers Quadratic dependency limits

Transformers Attention complexity

Transformers Attention complexity
  • Transformers have issues with long texts:
    • self attention is O(n^2)
    • cross attention is O(nk) Novel Transformer Architecture Papers
  • Efficient LLM papers are:
  • Hard to understand,
  • Hard to generalize (due to platform specific engineering tricks)
  • Expensive to reproduce
  • Inference run into Memory is an issue
  • Training is often on beginning of document so does not see the end
  • Self Attention is has a limited window size.

SLED - Locality

SLED - Locality

SLED - Properties

SLED - Properties
  • SLED’s Approach
    • Assume locality of information: “In an encoder-decoder architecture, the encoder can effectively contextualize input tokens with local context only, leaving long range dependency to be handled by the decoder.”
    • Split text into short fixed length overlapping chunks of text (short contexts).
    • Prepend the prefix/prompt to each chunk
    • The decoder will need to put it all together.

SLED Properties

SLED Properties

Model Size effect

Model Size effect

SLED Performance Boost

SLED Performance Boost

SLED is Competitive with short text models

SLED is Competitive with short text models

Analysis

Analysis
  • this is a great slide!
  • it summarizes lots of info
  • SLED’s Analysis
    • Contextual encoding is crucial
    • Cheating is not enough
    • The is real benefit in fusion Finding a Needle in a Haystack

Finding a Needle perfectly

Finding a Needle perfectly

Fusing Information Pieces

Fusing Information Pieces
  • what is Cheating?

Cheating is not enough

Cheating is not enough
  • Quantifying SLED’s benefits using relative improvement.

\text{Relative Improvement} = \frac{Score(SLED)-Score(Bart)}{Score(Bart)}

Gains Formula from longer inputs Gains

Gains Formula from longer inputs Gains

Chart of longer inputs Gains

Chart of longer inputs Gains

Limitations & Future Work

Limitations & Future Work
  • Limits & Future Work
    • Long outputs are still a constraint
    • No explicit global contextualization
    • No explicit global positional information
    • No applicable for decoder-only architecture
    • (Corrective) pre-training is expected to help

Takeways

Takeways
  • Takeaways
    • Individual pieces of information are localized
    • Fusioin in decoder works
    • SLED does well on long range tasks.

Questions

Questions
  • Main points

They point out that the encoder can usually do a adequate job of understanding the input by looking at local context. Mostly a window with a few surrounding sentences. It uses this to create encode the input into a compact representation we call the state. The decoder will then be leverage the compression with “adaquate” encodings to efficently retrieve results from much longer contexts during inference on different tasks.

Citation

BibTeX citation:
@online{bochman2015,
  author = {Bochman, Oren},
  title = {Efficient {Long-Text} {Understanding} with {Short-Text}
    {Models}},
  date = {2015-11-01},
  url = {https://orenbochman.github.io/posts/2023/01-11-nlp-il-meetup-intuit/talk2.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2015. “Efficient Long-Text Understanding with Short-Text Models.” November 1. https://orenbochman.github.io/posts/2023/01-11-nlp-il-meetup-intuit/talk2.html.